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Abstract 


Sparse  additive  models  are  families  of  d-variate  functions  that  have  the  additive  decompo¬ 
sition  f*  =  Yljes  fj  >  where  ■S'  is  a  unknown  subset  of  cardinality  sCd.  We  consider  the  case 
where  each  component  function  f*  lies  in  a  reproducing  kernel  Hilbert  space,  and  analyze  a 
simple  kernel-based  convex  program  for  estimating  the  unknown  function  /*.  Working  within 
a  high-dimensional  framework  that  allows  both  the  dimension  d  and  sparsity  s  to  scale,  we 
derive  convergence  rates  in  the  L2( P)  and  L2( Pn)  norms.  These  rates  consist  of  two  terms:  a 
subset  selection  term  of  the  order  s  og  ,  corresponding  to  the  difficulty  of  finding  the  unknown 
s-sized  subset,  and  an  estimation  error  term  of  the  order  sp2,  where  u2  is  the  optimal  rate  for 
estimating  an  univariate  function  within  the  RKHS.  We  complement  these  achievable  results  by 
deriving  minimax  lower  bounds  on  the  L 2  (P)  error,  thereby  showing  that  our  method  is  optimal 
up  to  constant  factors  for  sub- linear  sparsity  s  =  o(d).  Thus,  we  obtain  optimal  minimax  rates 
for  many  interesting  classes  of  sparse  additive  models,  including  polynomials,  splines,  finite-rank 
kernel  classes,  as  well  as  Sobolev  smoothness  classes. 

1  Introduction 

The  past  decade  has  witnessed  a  flurry  of  research  on  sparsity  constraints  in  statistical  models. 
Sparsity  is  an  attractive  assumption  for  both  practical  and  theoretical  reasons:  it  leads  to  more 
interpretable  models,  reduces  computational  cost,  and  allows  for  model  identifiability  even  under 
high-dimensional  scaling,  where  the  dimension  d  exceeds  the  sample  size  n.  While  a  large  body 
of  work  has  focused  on  sparse  linear  models,  many  applications  call  for  the  additional  flexibility 
provided  by  non-parametric  models.  In  the  general  setting,  a  non-parametric  regression  model 
takes  the  form  y  =  f*(x i, . . .  ,Xd)  +  w,  where  /  :  M*1  — >  M  is  the  unknown  regression  function, 
and  w  is  scalar  observation  noise.  Unfortunately,  this  general  non-parametric  model  is  known  to 
suffer  severely  from  the  so-called  “curse  of  dimensionality” ,  in  that  for  most  natural  function  classes 
(e.g.,  twice  differentiable  functions),  the  sample  size  n  required  to  achieve  any  given  error  grows 
exponentially  in  the  dimension  d. 

Given  this  curse  of  dimensionality,  it  is  essential  to  further  limit  the  complexity  of  possible 
functions  /*.  One  attractive  candidate  are  the  class  of  additive  non-parametric  models  [15],  in 
which  the  function  f*  has  an  additive  decomposition  of  the  form 
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where  each  component  function  f*  is  univariate.  Given  this  decoupling,  this  function  class  no 
longer  suffers  from  the  exponential  explosion  in  sample  size  of  the  general  non-parametric  model. 
Nonetheless,  one  still  requires  a  sample  size  n>  d  for  consistent  estimation;  note  that  this  is  true 
even  for  the  linear  model,  which  is  a  special  case  of  equation  (1). 

A  natural  extension  is  the  class  of  sparse  additive  models,  in  which  the  unknown  regression 
function  is  assumed  to  have  a  decomposition  of  the  form 

f*(xi,x2...,xd)  =  (2) 

j&s 

where  S  C  {1,2, . . .  ,d}  is  some  unknown  subset  of  cardinality  |£|  =  s.  Of  primary  interest  is  the 
case  when  the  decomposition  is  genuinely  sparse,  so  that  s  <C  d.  To  the  best  of  our  knowledge, 
this  model  class  was  first  introduced  in  Lin  and  Zhang  [21],  and  has  since  been  studied  by  various 
researchers  (e.g.,  [17,  23,  28,  37]).  Note  that  the  sparse  additive  model  (2)  is  a  natural  generalization 
of  the  sparse  linear  model,  to  which  it  reduces  when  each  univariate  function  is  constrained  to  be 
linear. 

In  past  work,  several  groups  have  proposed  computationally  efficient  methods  for  estimating 
sparse  additive  models  (2).  Just  as  In-based  relaxations  such  as  the  Lasso  have  desirable  properties 
for  sparse  parametric  models,  more  general  O-based  approaches  have  proven  to  be  successful  in 
this  setting.  Lin  and  Zhang  [21]  proposed  the  COSSO  method,  which  extends  the  Lasso  to  cases 
where  the  component  functions  f*  lie  in  a  reproducing  kernel  Hilbert  space  (RKHS);  see  also 
Yuan  [37]  for  a  similar  extension  of  the  non-negative  garrote  [7].  Bach  [3]  analyzes  a  closely  related 
method  for  the  RKHS  setting,  in  which  least-squares  loss  is  penalized  by  an  fi-sum  of  Hilbert 
norms,  and  establishes  consistency  results  in  the  classical  (fixed  d)  setting.  Other  related  £i-based 
methods  have  been  proposed  in  independent  work  by  Koltchinskii  and  Yuan  [17],  Ravikumar  et 
al.  [28]  and  Meier  et  al.  [23],  and  analyzed  under  high-dimensional  scaling.  As  we  describe  in  more 
detail  in  Section  3.3,  each  of  the  above  papers  establish  consistency  and  convergence  rates  for  the 
prediction  error  under  certain  conditions  on  the  covariates  as  well  as  the  sparsity  s  and  dimension 
d.  However,  it  is  not  clear  whether  the  rates  obtained  in  these  papers  are  sharp  for  the  given 
methods,  nor  whether  the  rates  are  minimax-optimal. 

This  paper  makes  two  main  contributions  to  this  on-going  line  of  research.  Our  first  contribution 
is  to  analyze  a  simple  polynomial-time  method  for  estimating  sparse  additive  models  and  provide 
upper  bounds  on  the  error  in  both  the  L2(Pn)  and  L2(F)  norms.  Our  method  is  based  on  a 
combination  of  least-squares  loss  with  two  ^i-based  sparsity  penalty  terms,  one  corresponding  to 
an  £i/L2(P„)  norm  and  the  other  an  £i/||  •  ||%  norm.  This  combination  yields  a  second-order  cone 
program,  for  which  solutions  can  be  computed  in  polynomial  time  using  interior-point  methods 
(see  §4,  11  in  Boyd  and  Vandenberghe  [6]  for  details).  Although  closely  related  to  the  methods 
considered  in  past  work  [3,  17,  23,  28],  our  estimator  differs  in  the  particular  form  of  regularization, 
and  we  suspect  that  these  differences  are  important  in  obtaining  optimal  convergence  rates.  Our 
first  main  result  (Theorem  1)  shows  that  that  with  high  probability,  the  error  of  our  procedure, 
in  either  the  squared  L2(Pn)  or  L2(P)  norms,  is  bounded  by  O ( g *°g d  +  szz2).  Each  of  these  two 
terms  has  a  natural  interpretation.  The  quantity  slo&d  is  a  subset  selection  term ,  which  reflects 
the  difficulty  of  extracting  the  s-sized  subset  of  active  functions  from  the  total  d.  On  the  other 
hand,  the  quantity  r'2  corresponds  to  the  optimal  rate  for  estimating  a  single  univariate  function,  so 
that  szz2  corresponds  to  the  s-dimensional  estimation  error  associated  with  estimating  s  univariate 
functions.  This  latter  term  depends  on  the  sparsity  s  but  not  on  the  ambient  dimension  d.  In  order 
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to  illustrate  these  rates  more  concretely,  we  discuss  two  particular  consequences  of  Theorem  1. 
First,  Corollary  1  applies  to  parametric  function  classes  and  m-rank  kernel  classes,  where  ~  . 

Second,  Corollary  2  applies  to  various  types  of  non-parametric  classes,  among  them  Sobolev  spaces, 
where  ~  n  2“/(2"+1);  for  some  a  >  1/2. 

Our  second  contribution  is  complementary  in  nature,  in  that  it  establishes  lower  bounds  that 
hold  uniformly  over  all  algorithms.  These  minimax  lower  bounds,  stated  in  Theorem  2,  are  specified 
in  terms  of  the  metric  entropy  of  the  underlying  univariate  function  classes.  For  both  finite-rank 
kernel  classes  and  Sobolev-type  classes,  these  lower  bounds  match  our  achievable  results,  as  stated 
in  Corollaries  1  and  2,  up  to  constant  factors  in  the  regime  of  sub- linear  sparsity  (s  =  o(d)).  Thus, 
for  these  function  classes,  we  have  a  sharp  characterization  of  the  associated  minimax  rates.  The 
proofs  of  these  results  are  based  on  characterizing  the  packing  entropies  of  the  class  of  sparse 
additive  models,  combined  with  the  use  of  the  Fano  method. 

The  lower  bounds  derived  in  this  paper  initially  appeared  in  the  Proceedings  of  the  NIPS  Con¬ 
ference  (December  2009).  As  we  were  completing  this  write-up,  we  became  aware  of  concurrent 
work  by  Koltchinskii  and  Yuan  [18]  (hereafter  KY)  that  analyzes  essentially  the  same  estimator  as 
that  used  to  prove  upper  bounds  in  this  paper.  As  with  our  analysis,  they  assume  that  the  unit  ball 
of  each  univariate  Hilbert  class  Tij,  for  j  =  1, ... .  d,  is  bounded.  Under  this  assumption,  they  derive 
a  result  (Theorem  3  in  their  paper)  that  contains  the  two  terms  involved  in  our  Theorem  1,  but  also 
includes  additional  pre-factors  that  depend  on  a  global  bound  on  the  function  class — that  is,  the 
quantity  C(J-(iyS)  =  sup g  ||/||oo)  where  J-(iyS  is  the  class  of  s-sparse  additive  models  in  d  dimen¬ 
sions.  Our  result  (Theorem  1  in  our  paper)  requires  only  that  each  univariate  function  is  bounded, 
which  is  much  less  restrictive  than  global  boundedness.  If  the  quantity  C(J-d,s)  remains  bounded 
independently  of  the  dimension  d  and  sparsity  s,  then  their  result  matches  our  rate  up  to  constant 
factors.  On  the  other  hand,  if  C{J-riyS)  scales  with  (d,  s),  then  our  bound,  which  has  no  dependence 
on  this  quantity,  is  tighter.  It  is  worth  noting  that  the  condition  C(J-cis)  =  0(1) — an  assumption 
that  might  seem  innocuous  at  first  sight — can  be  fairly  restrictive  for  sparse  additive  models  under 
the  high-dimensional  scaling  (d,  s )  — >  +oo.  If  a  global  boundedness  condition  is  imposed,  the  rates 
are  not  minimax-optimal  in  general — for  instance,  see  Example  1  in  Section  3.3.  In  addition  to 
global  boundedness,  there  are  other  differences  between  the  two  papers.  For  instance,  they  analyze 
a  slightly  more  general  class  of  quadratic- type  losses,  as  opposed  to  the  least-squares  loss  considered 
here,  and  their  analysis  involves  directly  imposing  RIP  conditions  on  fixed  design  matrices,  whereas 
we  consider  the  case  of  random  design  with  independent  co-ordinates  (although  our  results  hold 
albeit  with  slightly  worse  constants  if  we  impose  RIP  conditions  instead  of  independence). 

The  remainder  of  the  paper  is  organized  as  follows.  In  Section  2,  we  provide  background  on 
kernel  spaces  and  the  class  of  sparse  additive  models  considered  in  this  paper.  Section  3  is  devoted 
to  the  statement  of  our  main  results  and  discussion  of  their  consequences;  it  includes  description  of 
our  method,  the  convergence  rates  that  it  achieves,  and  a  matching  set  of  minimax  lower  bounds. 
Section  4  is  devoted  the  proofs  of  our  upper  and  lower  bounds,  presented  in  Sections  4.1  and 
Section  4.2  respectively,  with  the  more  technical  details  deferred  to  the  Appendices.  We  conclude 
with  a  discussion  in  Section  5. 


2  Background  and  problem  set-up 

We  begin  with  some  background  on  reproducing  kernel  Hilbert  spaces,  before  providing  a  precise 
definition  of  the  class  of  sparse  additive  models  studied  in  this  paper. 


3 


2.1  Reproducing  kernel  Hilbert  spaces 

Given  a  subset  X  C  M  and  a  probability  measure  Q  on  X,  we  consider  a  Hilbert  space  T~L  C  L2(Q), 
meaning  a  family  of  functions  g  :  X  — >  M,  with  ||  <y||z/2(Q)  <  00 1  and  an  associated  inner  product 
(•,  •)%  under  which  T-i  is  complete.  The  space  H  is  a  reproducing  kernel  Hilbert  space  (RKHS)  if 
there  exists  a  symmetric  function  M+  such  that:  (a)  for  each  x  G  X,  the  function 

K(-,  x)  belongs  to  the  Hilbert  space  Ti,  and  (b)  we  have  the  reproducing  relation  f(x)  =  (/,  K(-,  x))% 
for  all  /  E  H.  Any  such  kernel  function  must  be  positive  semidefinite;  under  suitable  regularity 
conditions,  Mercer’s  theorem  [25]  guarantees  that  the  kernel  has  an  eigen-expansion  of  the  form 

OO 

K(x,  x')  =  Y  VeMx)Mx')i  (3) 

e=\ 

where  n\  >  ^2  >  93  >  . . .  >  0  are  a  non-negative  sequence  of  eigenvalues,  and  {4>j}(^=1  are  the 
associated  eigenfunctions,  taken  to  be  orthonormal  in  L2(Q).  The  decay  rate  of  these  eigenvalues 
will  play  a  crucial  role  in  our  analysis,  since  they  ultimately  determine  the  rate  vn  for  the  univariate 
RKHS’s  in  our  function  classes. 

Since  the  eigenfunctions  {(f>e}<£Li  form  an  orthonormal  basis,  any  function  /  E  has  an  ex¬ 
pansion  of  the  f(x)  =  where  a e  =  (/,  4>£)L2^  =  fx  f(x)<j>e(x)  dQ(x)  are  (gen¬ 

eralized)  Fourier  coefficients.  Associated  with  any  two  functions  in  T~L — say  /  =  X£i  and 
g  =  b£(/>£ — are  two  distinct  inner  products.  The  first  is  the  usual  inner  product  in  the  space 

L2(Q) — namely,  (/,  g)L2(Q)  '■  =  fx  f(x)d(x)  dQ(x).  By  Parseval’s  theorem,  it  has  an  equivalent  rep¬ 
resentation  in  terms  of  the  expansion  coefficients — namely 

OO 

(/>  9) l2{v)  =  Yj  a^){' 

f=\ 


The  second  inner  product,  denoted  (/,  g)u,  is  the  one  that  defines  the  Hilbert  space;  it  can  be 
written  in  terms  of  the  kernel  eigenvalues  and  generalized  Fourier  coefficients  as 


(f,  g)n  = 


OO 


E 

£=\ 


w- 


For  more  background  on  reproducing  kernel  Hilbert  spaces,  we  refer  the  reader  to  various  standard 
references  [2,  29,  30,  34,  12]. 


2.2  Sparse  additive  models  over  RKHS 

For  each  j  =  1, . . . ,  0?,  let  Wj  C  L2(Q)  be  a  reproducing  kernel  Hilbert  space  of  univariate  functions 
on  the  domain  X.  Without  loss  of  generality  (by  re-centering  the  functions  as  needed),  we  may 
assume  that 

Elfj(x)}=  [  fj(x)dQ(x)  =  0  for  all  fj  €  Hj, 

J  x 

and  for  each  j  =  1,  2, . . . ,  d.  For  a  given  subset  S  C  {1,  2, . . . ,  d},  we  define 

H(S):={f  =  Yfi  I  /j^i,  and  H/jllw,  <1  VjeS},  (4) 
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corresponding  to  the  class  of  functions  /  :  Xd  — >  M  that  decompose  as  sums  of  univariate  functions 
on  co-ordinates  lying  within  the  set  S.  Note  that  'H(S')  is  also  (a  subset  of)  a  reproducing  kernel 
Hilbert  space,  in  particular  with  the  norm 


\\f\\u(S)  “  ll/jH Hp 

3&S 

where  ||  ■  ||^.  denotes  the  norm  on  the  univariate  Hilbert  space  T-Lj.  Finally,  for  a  cardinality 
s  G  {1,  2, . . . ,  |_c?/2j },  we  define  the  function  class 

Fd,s,n:=  U  n(S).  (5) 

Sc{l,2,...,d} 

\S\=s 


To  ease  notation,  we  frequently  adopt  the  shorthand  T  =  but  the  reader  should  recall  that 

T  depends  on  the  choice  of  Hilbert  spaces  {'Hj}d=1,  and  moreover,  that  we  are  actually  studying  a 
sequence  of  function  classes  indexed  by  (d,s). 

Now  let  P  =  Qd  denote  the  product  measure  on  the  space  Xd  C  Mrf.  Given  an  arbitrary  f*  G  J7, 
we  consider  the  observation  model 


Vi  =  f*(xi)  +  wh  for  i  =  1,2,  ...,n,  (6) 

where  {rcj}7=i  is  an  i.i.d.  sequence  of  standard  normal  variates,  and  {;ri}7=i  is  a  sequence  of  design 
points  in  Md,  sampled  in  an  i.i.d.  manner  from  P. 

Given  an  estimate  /,  our  goal  is  to  bound  the  error  f  —  f*  under  two  norms.  The  first  is  the  usual 
L2(P)  norm  on  the  space  T\  given  the  product  structure  of  P  and  the  additive  nature  of  any  /  G  J, 
it  has  the  additive  decomposition  ||/|||2^  =  J2j=i  ll/illi2^)-  bi  addition,  we  consider  the  error  in 
the  empirical  L2(Fn)-norm  defined  by  the  sample  {®j}^T1,  defined  as  ||/||22(p  )  :  =  n  £7=i  /2(^)- 
Unlike  the  L2(P)  norm,  this  norm  does  not  decouple  across  the  dimensions,  but  part  of  our  analysis 
will  establish  an  approximate  form  of  such  decoupling.  For  shorthand,  we  frequently  use  the 
notation  \\fW2  =  ||/||_l2(p)  and  ||/||n  =  ||/||x,2(]P„)  f°r  a  d- variate  function  /  6  J,  With  a  minor 
abuse  of  notation,  for  a  univariate  function  fj  G  'Hj.  we  also  use  the  shorthands  \\fjW2  =  \\fj  ||  l2(0) 
and  \\fjWn  =  ||/||z,2(Qn). 

3  Main  results  and  their  consequences 

This  section  is  devoted  to  the  statement  of  our  main  results,  and  discussion  of  some  of  their 
consequences.  We  begin  in  Section  3.1  by  describing  a  regularized  M-estimator  for  sparse  additive 
models,  and  we  state  our  convergence  results  for  this  estimator  in  Section  3.2.  We  illustrate  its 
convergence  rates  for  various  concrete  instances  of  kernel  classes.  In  Section  3.3,  we  provide  a 
detailed  comparison  between  our  results  to  past  and  concurrent  work,  including  discussion  of  the 
effect  of  global  boundedness  conditions  on  optimal  rates.  Finally,  in  Section  3.4,  we  state  minimax 
lower  bounds  on  the  L2(P)  error,  which  establish  the  optimality  of  our  procedure. 
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3.1  A  regularized  M-estimator  for  sparse  additive  models 

For  any  function  of  the  form  /  =  fj ,  the  ( L2(Qn ),  1)  and  {T~L,  l)-norms  are  given  by 

d  d 

\\f\\n,i  ■=  ^2  \\fj\\n,  and  \\f\\n,i-  =  J2\\fj\\n,  (7) 

3= 1  3= 1 

respectively.  Using  this  notation,  we  define  the  cost  functional 

1  n 

£(f)  =  (yi  ~  +  Anll/IU,l  +  Pn\\f\\n,l-  (8) 

2—1 

The  cost  functional  £(/)  is  least-squares  loss  with  a  sparsity  penalty  ||/||n,i  and  a  smoothness 
penalty  H/H^i.  Here  (A n,  pn)  are  a  pair  of  positive  regularization  parameters  whose  choice  will  be 
specified  by  our  theory.  Given  this  cost  functional,  we  then  consider  the  M-estimator 

7  6  argmin £(/)  subject  to  /  =  Yl1=i  7'  and  WfjWn  <  1  for  all  j  =  1,  2, . . . ,  d.  (9) 
/ 

As  stated,  the  problem  (9)  is  infinite-dimensional  in  nature,  since  it  involves  optimization  over 
Hilbert  spaces.  However,  an  attractive  feature  of  this  M-estimator  is  that,  as  a  straightforward 
consequence  of  the  representer  theorem  [16,  30],  it  can  be  reduced  to  an  equivalent  convex  program 
in  Mn  x  Md.  In  particular,  for  each  j  =  1,2, ...  ,d,  let  K7  denote  the  kernel  function  for  co-ordinate 
j.  Using  the  notation  x*  =  (xn,  x*2,  ■  ■  ■ ,  for  the  ith  sample,  we  define  the  collection  of  empirical 
kernel  matrices  £  Wnxn,  with  entries  KL  =  X(j).  By  the  representer  theorem,  any 

solution  /  to  the  variational  problem  (9)  can  be  written  in  the  form 


n  d 

f{z i,  •  •  ■ ,  Zd)  =  57  aijK3(zj,Xij), 

*=1  3= 1 

for  a  collection  of  weights  {ay  £  Mn,  j  =  1, . . .  ,d}.  The  optimal  weights  are  obtained  by  solving 
the  convex  program 


(Si,...,  aid)  =  arg 


f  1  .  a 

min  <  — ||y  —  K^aA\\ 

aj£Rn  ^  J"2 

aj  K>  a.j<l  i~l 


(10) 


This  problem  is  a  second-order  cone  program  (SOCP),  and  there  are  various  algorithms  for  solving 
it  to  arbitrary  accuracy  in  time  polynomial  in  ( n,d ),  among  them  interior  point  methods  (e.g.,  see 
§11  in  the  book  [6]). 


Various  combinations  of  sparsity  and  smoothness  penalties — all  slightly  different  than  the  ap¬ 
proach  proposed  here — have  been  used  in  in  past  work  on  sparse  additive  models.  For  instance, 
the  method  of  Ravikumar  et.  al  [28]  is  based  least-squares  loss  regularized  with  single  sparsity  con¬ 
straint,  and  separate  smoothness  constraints  for  each  univariate  function.  They  solve  the  resulting 
optimization  problem  using  a  back-fitting  procedure.  Koltchinskii  and  Yuan  [17]  develop  a  method 
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based  on  least-squares  loss  combined  with  a  single  penalty  term  ||/j||%.  Their  method  also 

leads  to  an  SOCP  if  77  is  a  reproducing  kernel  Hilbert  space,  but  differs  from  the  program  (10)  in 
lacking  the  additional  sparsity  penalties.  Meier  et.  al  [23]  analyzed  least-squares  regularized  with  a 

penalty  term  of  the  form  Yl<j=\  ^ ^i\\fj\\n  +  where  Ai  and  A2  are  a  pair  of  regularization 

parameters.  In  their  method,  Ai  controls  the  sparsity  while  A2  controls  the  smoothness.  If  T~L  is  an 
RKHS,  the  method  in  Meier  et.  al  [23]  reduces  to  an  ordinary  group  Lasso  problem  on  a  different 
set  of  variables,  which  is  another  type  of  SOCP. 


3.2  Convergence  rates 

We  now  state  a  result  that  provides  convergence  rates  for  the  estimator  (9),  or  equivalently  (10). 
To  simplify  presentation,  we  state  our  result  in  the  special  case  that  the  univariate  Hilbert  space 
7ij,j  =  1 , ,d  are  all  identical,  denoted  by  %.  The  analysis  and  results  extend  in  a  straightforward 
manner  to  the  general  setting  of  distinct  univariate  Hilbert  spaces,  as  we  discuss  following  the 
statement  of  Theorem  1. 

Let  >  fi2  >  •  •  •  >  0  denote  the  non-negative  eigenvalues  of  the  kernel  operator  defining  the 
univariate  Hilbert  space  77,  as  defined  in  equation  (3),  and  define  the  function 

1  OO 

Kn{t)  ■  =  -7=  [  X]  min{t2> «}] 1/2 •  (H) 

y'n  1=  1 


For  a  constant  k,q  >  0  to  be  chosen,  let  vn  >  0  be  the  smallest  positive  solution  to  the  inequality 


Z'n  >  K0  TlniVn). 


(12) 


We  refer  to  vn  as  the  critical  univariate  rate ,  as  it  is  the  minimax-optimal  rate  for  L2(P)-estimation 
of  a  single  univariate  function  in  the  Hilbert  space  77  (e.g.,  [24,  32]).  This  quantity  will  be  referred 
to  throughout  the  remainder  of  the  paper. 

Our  choices  of  regularization  parameters  are  specified  in  terms  of  the  quantity 


7n  :  =  «i  max 


(13) 


where  k±  >  0  is  a  sufficiently  large  constant,  independent  of  the  sample  size  and  function  classes. 
We  assume  that  each  function  within  the  unit  ball  of  the  univariate  Hilbert  space  is  bounded — that 
is,  for  each  j  =  1, . . . ,  d 

1 1 f j 1 1 00  <  1  for  all  \\fj\\n  <  1-  (14) 

This  condition  is  fairly  mild,  and  is  implied  by  having  a  bounded  univariate  kernel  function,  for 
instance.  These  types  of  boundedness  condition  are  quite  standard  for  proving  upper  bounds  on 
rates  of  convergence  for  non-parametric  least  squares  in  the  univariate  case  d  =  1  (see  e.g.  [31,  32]). 
However,  note  that  we  do  not  assume  that  the  functions  /  =  fj  in  ^  are  uniformly  bounded 

independently  of  ( d,s ). 


The  following  result  applies  to  any  class  Td.s.u  of  sparse  additive  models  based  on  the  univariate 
Hilbert  space  satisfying  condition  (14),  and  to  the  estimator  (9)  based  on  n  i.i.d.  samples  (a:*,  2/i)?=i 
from  the  observation  model  (6). 
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Theorem  1.  Let  f  be  any  minimizer  of  the  convex  program  (9)  with  regularization  parame¬ 
ters  \n  =  C37n  and  pn  =  C472  for  sufficiently  large  constants  C3  and  C4.  Then  provided  that 
ny2  =  ^(log^/^n)),  there  are  universal  constants  (C,  ci,c2)  such  that 


\\f-ni>c{s-^+sui} 


<  Cl  exp(— c2n7^). 


(15) 


Remarks:  (a)  The  technical  condition  ny2  =  n(log(l/7n))  is  quite  mild,  and  satisfied  in  most 
cases  of  interest,  among  them  the  kernels  considered  below  in  Corollaries  1  and  2. 

(b)  Although  Theorem  1  is  stated  for  the  empirical  L2(Fn)  error,  the  same  bound  holds  for  the 
error  ||IIj-(/)  —  f* ||2,  where  Hjr(f)  is  the  projection  of  /  onto  the  class  T  under  the  L2(Pn)-norm. 
Although  we  suspect  that  the  error  \\  f  —  f*  ||2  satisfies  the  same  bound  (15),  our  current  techniques 
only  allow  us  to  control  the  projected  function  Hj?(f).  If  we  imposed  a  global  boundedness  con¬ 
dition,  then  it  would  follow  that  ||/  —  f*  ||2  has  the  same  scaling  as  \\  f  —  f*\\n  under  the  given 
conditions.  It  remains  an  open  question  if  one  can  directly  establish  such  a  bound  without  a  global 
boundedness  condition. 


(c)  For  clarity,  we  have  stated  our  result  in  the  case  where  the  univariate  Hilbert  space  TL  is  identical 
across  all  co-ordinates.  However,  our  proof  extends  with  only  notational  changes  to  the  general 
setting,  in  which  each  co-ordinate  j  is  endowed  with  a  (possibly  distinct)  Hilbert  space  TLj.  In  this 
case,  the  M -estimator  returns  a  function  /  such  that  (with  high  probability) 


11/ -r 


s  log  d 
n 


+  '}2"n,v 
j£S 


where  z/nj  is  the  critical  univariate  rate  associated  with  the  Hilbert  space  TLj,  and  S  is  the  subset 
on  which  f*  is  supported. 


(d)  As  described  in  the  introduction,  the  rate  sI°gd  +  sv2  may  be  interpreted  as  the  sum  of  a  subset 
selection  term  ( 5 *°g d )  and  an  s-dimensional  estimation  term  (sv2).  Note  that  the  subset  selection 
term  ( s  —  - )  is  independent  of  the  choice  of  Hilbert  space  TL  whereas  the  s-dimensional  estimation 
term  is  independent  of  the  ambient  dimension  d.  Depending  on  the  scaling  of  the  triple  (n,  d ,  s)  and 
the  smoothness  of  the  univariate  RKHS  TL,  either  the  subset  selection  term  or  function  estimation 
term  may  dominate.  In  general,  if  =  o(u2),  the  s-dimensional  estimation  term  dominates,  and 
vice  versa  otherwise.  At  the  boundary,  the  scalings  of  the  two  terms  are  equivalent. 


Theorem  1  has  a  number  of  corollaries,  obtained  by  specifying  particular  choices  of  kernels. 
First,  we  discuss  m-rank  operators,  meaning  that  the  kernel  function  K  can  be  expanded  in  terms 
of  m  eigenfunctions.  This  class  includes  linear  functions,  polynomial  functions,  as  well  as  any 
function  class  based  on  finite  dictionary  expansions. 


Corollary  1.  Under  the  same  conditions  as  Theorem  1,  consider  an  univariate  kernel  with  finite 
rank  m.  Then  any  solution  f  to  the  problem  (9)  satisfies 


max{||/-r||2,||H^(/)-r||2}>C{ 


slogd 


m  7 

+  S-] 

n  J 


<  ci  exp  (  —  C2(m  +  log  d)).  (16) 


n 


Proof.  It  suffices  to  show  that  the  critical  univariate  rate  (12)  satisfies  the  scaling  iff,  =  0{m/n). 
For  a  finite-rank  kernel  and  any  t  >  0,  we  have 


lZn{t) 


Y  min {t2,  Hj} 


3= 1 


from  which  the  claim  follows  by  the  definition  (12). 


□ 


Next,  we  present  a  result  for  the  RKHS’s  with  infinitely  many  eigenvalues,  but  whose  eigenvalues 
decay  at  a  rate  H£  —  (1  /l)2a  for  some  parameter  a  >  1/2.  Among  other  examples,  this  type  of 
scaling  covers  the  case  of  Sobolev  spaces,  say  consisting  of  functions  with  a  derivatives  (e.g.,  [5,  13]). 

Corollary  2.  Under  the  same  conditions  as  Theorem  1,  consider  an  univariate  kernel  with  eigen¬ 
value  decay  ~  (l/£)2a  for  some  a  >  1/2.  Then  the  kernel  estimator  defined  in  (9)  satisfies 

<  ci  exp  (  —  C2  (n 2^+t  +  log  d)) . 

(17) 

Proof.  As  in  the  previous  corollary,  we  need  to  compute  the  critical  univariate  rate  vn.  Given  the 

i_  i 

assumption  of  polynomial  eigenvalue  decay,  a  truncation  argument  shows  that  7 Zn(t)  =  O  ( *  )  • 

i_  j_ 

Consequently,  the  critical  univariate  rate  (12)  satisfies  the  scaling  v2  x  vn  2a  /y/n,  or  equivalently, 

o  _ 2a  _ 

x  n  2a+1  .  □ 


max  • 


s  log  d 


n 


+  s(_)2«+l} 

n 


3.3  Comparison  with  other  work 

It  is  interesting  to  compare  these  convergence  rates  in  L2(Pn )  error  with  those  established  in  past 
work  [17,  23,  28].  Ravikumar  et.  al  [28]  show  that  any  solution  to  their  back-fitting  method  is  con¬ 
sistent  in  terms  of  mean-squared  error  risk  (see  Theorem  3  in  their  paper).  However,  their  analysis 
does  not  appear  to  track  s  explicitly,  and  assumes  that  d  is  sufficiently  large  to  ensure  that  the 
subset  selection  term  dominates,  so  the  result  is  not  directly  comparable.  The  method  of  Koltchin- 
skii  and  Yuan  [17]  is  based  regularizing  the  least-squares  loss  with  the  (77,  l)-norm  penalty — that 
is,  Xj=i  WfjWn;  Theorem  2  in  their  paper  presents  a  rate  that  captures  the  decomposition  into  two 
terms,  which  can  be  interpreted  as  related  to  subset  selection  and  s-dimensional  estimation  term. 
In  quantitative  terms,  however,  their  rates  are  looser  than  those  given  here;  in  particular,  their 
bound  includes  a  term  of  the  order  s  1°gd,  which  is  larger  than  the  bound  in  Theorem  1.  For  their 

algorithm,  Meier  et  al.  [23]  establish  a  convergence  rate  of  the  form  C(s(^yp) 2a+1 )  in  the  case  of 
a-smooth  Sobolev  spaces  (see  Theorem  1  in  their  paper).  This  result  is  sub-optimal  compared  to 
the  optimal  rate  proven  in  Theorem  2(b)  in  regimes  when  d  is  large.1  In  all  of  the  above-mentioned 
methods,  it  is  unclear  whether  or  not  sharper  analysis  would  yield  better  rates. 

Finally,  as  discussed  previously  in  the  introduction,  the  concurrent  work  of  Koltchinskii  and 
Yuan  [18]  analyzes  a  method  that  is  essentially  the  same  as  our  M-estimator  (9).  In  terms  of  rates 

1More  precisely,  we  either  have  <  (^f^) 2a+1 ,  when  subset  selection  term  dominates,  or 

(-)  2q+1  <  (^sA)  2q+i  when  the  s-dimensional  estimation  term  dominates. 

^  n '  '  n  >  5 
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obtained,  they  establish  a  convergence  rate  based  on  two  terms  as  in  Theorem  1,  but  with  a  pre¬ 
factor  that  depends  on  the  global  bound  C(Fd.s)  =  sup ||/||oo-  (Recall  that  functions  /  G  Fd)S 
consist  of  sums  of  the  form  /  =  fj-  where  S  has  cardinality  s.)  In  contrast,  our  pre-factor 

contains  no  dependence  on  this  global  quantity.  Thus,  if  one  assumes  that  C(Fd,s)  =  0(1)  even 
as  (d,  s)  scale,  then  the  rates  obtained  are  the  same  up  to  constant  factors.  However,  making  such 
an  assumption  in  the  high-dimensional  setting  can  be  quite  restrictive.  Indeed,  as  shown  by  the 
following  example,  it  can  lead  to  quite  “small”  function  classes  Fd,s  for  which  much  faster  rates 
can  be  achieved  using  different  methods. 

Example  1  (Restrictiveness  of  assuming  global  boundedness).  Suppose  that  each  covariate  Xj  is 
uniform  on  [— 1,+1],  and  consider  the  class  of  univariate  linear  functions 

H  :  =  {ga  :  M  — >  M  |  a  G  M},  where  gQ(xj)  =  axj. 


Thus,  our  function  class  F  =  Fd  s  consists  of  sparse  linear  functions  of  the  form 


fp(x)  = 

j&s  j&s 


Since  ||<7/5 ■  ||oo  =  \Pj\i  boundedness  of  the  univariate  classes  amounts  to  the  requirement  \/3j\  <  1. 
Moreover,  for  any  function  fp  G  F,  note  that  we  have  ||//?||oo  =  ||/3||i-  Consequently,  impos¬ 
ing  the  global  boundedness  condition  C(F )  =  sup^gjr  H/^Hoo  <  R  is  equivalent  to  the  con¬ 
straint  ||/3||i  <  R ,  so  that  the  problem  reduces  to  ordinary  linear  regression  over  the  f?i-ball 
E>i(R)  =  {/3  G  Rd  |  ||/3||i  <  R}.  For  this  problem,  it  is  known  [8,  27]  that  the  Lasso  will  produce  an 
estimate  such  that 

0-n n  R\J^  (18) 

with  high  probability.2  This  rate  is  independent  of  s  because  the  global  boundedness  condition 
restricts  us  to  a  f’l-ball  with  constant  radius  R ;  indeed,  the  rate  (18)  is  minimax-optimal  over  the 
set  B>i(R)  (e.g.,  see  the  paper  [27]).  In  contrast,  the  error  bound 


-nl-  +  — 

n  n 


(19) 


does  depend  on  s  and  so  can  be  substantially  weaker,  depending  on  the  choice  of  s.  For  exam¬ 
ple,  taking  s  =  \\Zd~],  the  optimal  rate  (18)  scales  logarithmically  in  d  whereas  the  scaling  of  the 
sub-optimal  rate  (19)  is  exponentially  larger.  This  construction  shows  that  the  rates  derived  under 
global  boundedness  conditions  are  not  minimax-optimal  in  general. 


Returning  to  the  setting  of  a  general  RKHS  77 (S'),  a  global  boundedness  condition  imposes  an 
upper  bound  on  the  Hilbert  norm  radius  of  functions  in  Fs,d  =  U|s|=iS77(S).  Indeed,  for  a  given 
subset  S,  let  p  >  0  be  the  largest  radius  such  that  {||/||^(s)  <  p}  Q  F.  Then  for  any  x  G  X,  we 

“Here  we  use  to  denote  inequality  up  to  constant  factors  depending  on  variances  of  the  design  and  noise.  This  is 
the  optimal  rate  for  regression  over  H-balls,  as  opposed  to  ^o-balls. 
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have 


sup|/(x)|>  sup  \f(x)\ 

f \\f\\n(S)<P 

=  sup  \{f,  K(-,x))W(S)| 

ll/H-H(S)<P 

=  p||K(-,s)||W(S) 

=  py^(x,x), 


where  K  denotes  the  kernel  associated  with  7-L(S).  By  definition  of  the  Hilbert  space  7~L(S),  we  have 
K(x,  x)  =  J2jeS  ^(xj,Xj),  where  K(xj,xj)  is  the  univariate  kernel  over  co-ordinate  j.  Consequently, 
we  have  the  lower  bound 


p  sup  ||K(-,x)||w(s)  =  p  sup  /y'l 
x&x\s\  xex\s\  y  ^ 

>  p  yfs  sup  y/K(xi,  xi), 

x\&X 

showing  that  we  require  the  bound  p  =  0(l/y/s)  in  order  to  ensure  C{FStd)  =  0(1). 


3.4  Minimax  lower  bounds 

In  this  section,  we  provide  minimax  lower  bounds  in  L2(P)  error  so  as  to  complement  the  achiev- 
ability  results  derived  in  Theorem  1.  Given  the  function  class  J7,  the  minimax  L2(P)-error  is  given 
by 

9JtP(J-)  :=  inf  sup  ||/n- ml,  (20) 

fn 

where  the  infimum  is  taken  over  all  measureable  functions  of  n  samples  {{yi,Xi)}™_  1.  As  defined, 
this  minimax  error  is  a  random  variable,  and  our  goal  is  to  obtain  a  lower  bound  in  probability. 

Central  to  our  proof  of  the  lower  bounds  is  the  metric  entropy  structure  of  the  univariate 
reproducing  kernel  Hilbert  spaces.  More  precisely,  our  lower  bounds  depend  on  the  packing  entropy, 
defined  as  follows.  Let  (G,p)  be  a  totally  bounded  metric  space,  consisting  of  a  set  Q  and  a  metric 
p  :  Q  x  Q  — >•  M+.  An  e-packing  of  Q  is  a  collection  j/1, . . . ,  fM}  C  Q  such  that  p(fl ,  /J )  >  e  for 
all  i  7^  j .  The  e-packing  number  M(e;  Q ,  p)  is  the  cardinality  of  the  largest  e-packing.  The  packing 
entropy  is  the  simply  the  logarithm  of  the  packing  number,  namely  the  quantity  log  M(e;  Q,  p),  to 
which  we  also  refer  as  the  metric  entropy. 

With  this  set-up,  we  derive  explicit  minimax  lower  bounds  for  two  different  scalings  of  the  uni¬ 
variate  metric  entropy. 


Logarithmic  metric  entropy:  There  exists  some  m  >  0  such  that 

logM(e;B%(l),  ||  ?  H2)  ~  m  log(l/e)  for  all  e  €  (0, 1).  (21) 

Function  classes  with  metric  entropy  of  this  type  include  linear  functions  (for  which  m  =  k), 
univariate  polynomials  of  degree  k  (for  which  m  =  k  +  1),  and  more  generally,  any  function  space 
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with  finite  VC-dimension  [33].  This  type  of  scaling  also  holds  for  any  RKHS  based  on  a  kernel  with 
rank  rn  (e.g.,  see  [10]),  and  these  finite-rank  kernels  include  both  linear  and  polynomial  functions 
as  special  cases. 


Polynomial  metric  entropy  There  exists  some  a  >  0  such  that 

log M(e; ||  ■  ||2)  ^  (l/e)1^  for  all  e  E  (0, 1).  (22) 

Various  types  of  Sobolev/Besov  classes  exhibit  this  type  of  metric  entropy  decay  [5,  13].  In  fact, 
any  RKHS  in  which  the  kernel  eigenvalues  decay  at  a  rate  j~2a  have  a  metric  entropy  with  this 
scaling  [9,  10]. 

We  are  now  equipped  to  state  our  lower  bounds  on  the  minimax  risk  (20): 

Theorem  2.  Given  n  i.i.d.  samples  from  the  sparse  additive  model  (6)  with  sparsity  s  <  d/ 4, 
there  is  an  universal  constant  C  >  0  such  that: 

(a)  For  a  univariate  class  di  with  logarithmic  metric  entropy  (21)  indexed  by  parameter  m,  we 
have 


WIf(F)  >  C 


s\og(d/s)  m 

n  n 


(23) 


with  probability  greater  than  1/2. 


(b)  For  a  univariate  class  di  with  polynomial  metric  entropy  (22)  indexed  by  a,  we  have 


9TCp(.F)  >  C 


s\og(d/s) 

n 


+  *0 


n' 


2a 

2ck  +  1 


(24) 


with  probability  greater  than  1/2. 

The  choice  of  stating  bounds  that  hold  with  probability  1/2  is  simply  a  convention  often  used  in 
information-theoretic  approaches  (see,  for  instance,  the  papers  [14,  35,  36]).  We  note  that  analogous 
lower  bounds  can  established  with  probabilities  arbitrarily  close  to  one,  albeit  at  the  expense  of 
worse  constants.  The  most  important  consequence  of  Theorem  2  is  in  establishing  the  minimax- 
optimality  of  the  results  given  in  Corollary  1  and  2;  in  particular,  in  the  regime  sub-linear  sparsity 
(i.e. ,  for  which  logd  =  0(log(d/s))),  the  combination  of  Theorem  2  with  these  corollaries  identifies 
the  minimax  rates  up  to  constant  factors. 


4  Proofs 

In  this  section,  we  provide  the  proofs  of  our  main  results,  namely  Theorems  1  and  2.  For  clarity 
in  presentation,  we  split  the  proofs  up  into  a  series  of  lemmas,  with  the  bulk  of  the  more  technical 
proofs  deferred  to  the  appendices.  This  splitting  allows  our  presentation  in  the  main  text  to  be 
relatively  streamlined. 
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4.1  Proof  of  Theorem  1 


At  a  high-level,  Theorem  1  is  based  on  an  appropriate  adaptation  to  the  non-parametric  setting 
of  various  techniques  that  have  been  developed  for  bounding  the  error  in  sparse  linear  regression 
(e.g.,  [4,  26]).  In  contrast  to  the  parametric  setting  (where  classical  tail  bounds  are  sufficient), 
controlling  the  error  terms  in  this  analysis  requires  more  advanced  techniques  from  empirical  process 
theory.  In  particular,  we  make  use  of  concentration  theorems  for  Gaussian  and  empirical  processes 
(e.g.,  [19,  22])  as  well  as  results  on  the  Rademacher  complexity  of  kernel  classes  [24],  At  a  high- 
level,  the  proof  is  based  on  four  technical  lemmas.  First,  Lemma  1  provides  an  upper  bound  on 
the  Gaussian  complexity  of  any  function  of  the  form  /  =  i  fj  iR  terms  of  the  norms  ||  •  \\ua 
and  ||  •  ||nji  previously  defined.  Lemma  2  exploits  the  notion  of  decomposability  [26],  as  applied 
to  these  norms,  in  order  to  show  that  the  error  function  belongs  to  a  particular  cone-shaped  set. 
Finally,  Lemma  3  and  4  establish  some  relations  between  the  that  the  L2(P)  and  L2(Pn)  norms  of 
functions  in  the  class  T .  The  latter  lemma  involves  a  truncation  argument  so  as  to  avoid  having 
to  impose  global  bounds  on  the  function  class. 

Throughout  the  proof,  we  use  C  and  c*,  i  =  1,  2,  3, 4  to  denote  universal  constants,  independent 
of  ( n,d,s ).  Note  that  the  precise  numerical  values  of  these  constants  may  change  from  line  to 
line.  We  use  (ko,  k\,  K2,  K3)  to  denote  constants,  independent  of  ( n,d,s ),  but  whose  value  is  fixed 
throughout.  To  ease  notation,  we  define 


i-2  _  rSlogrf  21 
5n  .  —  K2  {  h  svnj, 


where  the  constant  K2  >  0  is  to  be  chosen.  Recall  the  definitions  of  vn  and  7 n  from  equations  (12) 
and  (13)  respectively,  and  note  that  5n  =  ©(v ^In)-  For  a  subset  A  C  {1,  2, . . . ,  d}  and  an  additively 
decomposed  function  /  =  Yfj=x  fj  ■  we  adopt  the  convenient  notation 


||/A||n,i  ■=  ^2\\fj\\n,  and  ||/a||h,i 
jeA 


WfjWn- 

jeA 


(25) 


4.1.1  Establishing  a  basic  inequality 

We  begin  by  establishing  a  basic  inequality  on  the  error  function  A  :=/  —  /*.  Since  /  and  f*  are, 
respectively,  optimal  and  feasible  for  the  problem  (9),  we  are  guaranteed  that  £(/)  <  £(/*),  and 
hence  that  the  error  function  A  satisfies  the  bound 


1 

2  n 


J2  k  - 

i= 1 


A {Xi)Y  +  An||/||nii  +  Pnll/H'H,!  < 


1 

2  n 


"F  ^||/*|]n,l  +  Pn\\f*\\n,l- 

i= 1 


Some  simple  algebra  yields  the  bound 


< 


1 

n 


n 

^  H  |  +  An|| A||re)i  +  pn || A , 

i=  1 


(26) 


which  we  refer  to  as  our  basic  inequality  [32], 
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4.1.2  Controlling  the  noise  term 

The  following  lemma  provides  control  the  term  on  the  right-hand  side  of  inequality  (26)  by  simul¬ 
taneously  bounding  the  Gaussian  complexity  for  univariate  function  A j  in  terms  of  their  ||.||n  and 

||.||%  norms.  In  particular,  recalling  that  yn  =  K\  max{ \J~ ,  vn},  we  have  the  following  lemma. 

Lemma  1.  For  a  constant  k3  >  0,  define  the  event 

7'(7n):={v  j  =  1,2, . . .  ,d,  <2k3  {7n  W^jWu  +  7n  ||AJ-||n}|.  (27) 

1  n  i=i  J 

Then  under  the  condition  nj^  =  fl(log(l/7n));  we  have 

f(^(7 n))  >  1  -  ci  exp(-c2ri7^).  (28) 

The  proof  of  this  lemma,  provided  in  Appendix  A,  uses  concentration  of  measure  for  Lipschitz 
functions  over  Gaussian  random  variables  [19]  combined  with  a  peeling  argument  [1,  32].  The 
subset  selection  term  (s  ogd)  in  Theorem  1  arises  from  taking  the  maximum  over  all  d  components. 

4.1.3  Exploiting  decomposability 

The  remainder  of  our  analysis  involves  conditioning  on  the  event  T{^n)-  Using  Lemma  1,  on  the 
event  T(yn)  we  have: 

2  ll^lln  —  2k37,^||  A||n,i  +  2ft37ril|A||?yi  +  An||  A||n)1  +  Pn||A||^.i. 

Recalling  that  S  denotes  the  true  support  of  the  unknown  function  /*,  note  that  we  have  ||  A||n  1  = 
|| A15,||riji  +  || A5c||n  1,  with  a  similar  decomposition  for  ||A||%j.  The  next  lemma  shows  that  con¬ 
ditioned  on  T(yn),  the  quantities  ||A||%i  and  ||A||ni  are  not  significantly  larger  than  the  corre¬ 
sponding  norms  as  applied  to  the  function  A g. 

Lemma  2.  Conditioned  on  T(7n),  and  with  the  choices  \n  >  4k37 n  and  pn  >  4k37^,  we  have 

An|| A||njl  +  pn\\A\\uti  <  4An||  As||n,i  +  4/jn|| AgH^i  (29) 

The  proof  of  this  lemma,  provided  in  Appendix  B,  is  based  on  the  decomposability  [26]  of  the  ||  •  ||%  1 
and  ||  •  || n  i  norms.  This  lemma  allows  us  to  exploit  the  sparsity  assumption,  since  in  conjunction 
with  Lemma  1,  we  have  now  bounded  the  right-hand  side  of  the  basic  inequality  (26)  in  terms 
involving  only  A g.  In  particular,  still  conditioning  on  Tfi/n)  and  applying  Lemma  2,  we  obtain 

l^lln  <  C  {7n||As||n)i  +  7^||As||W)i  +  An||As||nii  +  pn||As||W)i} 

<  c  {7n||As||nii  +  7nll  As||k,i}) 

where3  we  have  recalled  our  choices  Xn  =  0(yn)  and  pn  =  @(7^).  Finally,  since  both  fj  and  f* 
belong  to  B%(1),  we  have 

W^jWn  <  WfjWn  +  WfjWn  <  2> 

which  implies  that  HAgH^i  <  2s,  and  hence 

||A||2  <C{7n||As||nil  +  s7^}.  (30) 

Tn  this  step  and  elsewhere,  the  reader  should  be  reminded  of  our  convention  that  the  numerical  value  of  C  can 
change  from  line  to  line. 
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4.1.4  Relating  the  L2( P„)  and  L2( P)  norms 

It  remains  to  control  the  term  ||As||n;i  =  YljeS  HAj'IU-  Ideally,  we  would  like  to  upper  bound  it 
by  t/sHA^I^.  Such  an  upper  bound  would  follow  immediately  if  it  were  phrased  in  terms  of  the 
||  •  || 2  rather  than  the  ||  •  ||n  norm,  but  there  are  additional  cross-terms  with  the  empirical  norm. 
Accordingly,  we  make  use  of  two  lemmas  that  relate  the  ||  •  ||n  norm  and  the  population  ||  ■  1 2  norms 
for  functions  in  T . 

In  the  statements  of  these  results,  we  adopt  the  notation  g  and  g.j  (as  opposed  to  /  and  fj) 
to  be  clear  our  results  apply  to  any  g  e  T ' .  We  first  provide  an  upper  bound  on  the  empirical 
norm  ||5j||n  hi  terms  of  the  associated  1 1 <7^  1 1 2  norm,  one  that  holds  uniformly  over  all  components 
j  =  1,2, . . .  ,d. 

Lemma  3.  For  a  universal  constant  C  and  j  =  1,2 , ,d,  consider  the  events 

A'(7n)  :=  {IlSjlln  <  4||3j||2  +  C^n  for  all  gj  G  Bw(2)},  (31) 

as  well  as  A(~fn)  =  Fdl=lAj{^n) .  If  the  univariate  Hilbert  space  %  satisfies  condition  (14),  then 
there  are  universal  constants  (ci,C2)  such  that 

P[-4(7„)]  >  1  -  ci  exp(-c2n7^). 

We  now  define  the  function  class  2T  :  =  {/  +  f  \  /,  f  £  J-}.  Our  second  lemma  guarantees 
that  the  empirical  norm  ||  ■  ||n  of  any  function  in  2 T  is  uniformly  lower  bounded  by  the  norm  ||  ■  ||2. 

Lemma  4.  Define  the  event 

B(Sn)  :=|||ff||^  >  II5II2/4  f°r  all  g  e  2H  with  \\g\\2  >  6n^ .  (32) 

If  the  underlying  univariate  Hilbert  space  H  satisfies  condition  (14),  then  there  are  universal  con¬ 
stants  (ci,c2)  such  that 

P[)B(5Tl)]  >  1  -  ciexp(-c2n^). 

Lemmas  3  and  4  are  proved  in  Appendices  F  and  D,  respectively.  Note  that  while  both  results 
require  bounds  on  the  univariate  function  classes  (recall  condition  (14)),  they  do  not  require  global 
boundedness  assumptions — that  is,  on  quantities  of  the  form  ||  Typically,  we  expect 

that  the  ||  ■  ||oo-norms  of  functions  g  E  T  scale  with  s. 

4.1.5  Completing  the  proof 

Using  Lemmas  3  and  4,  we  can  complete  the  proof  of  Theorem  1.  For  the  remainder  of  the  proof, 
let  us  condition  on  the  events  A( 7?1)  and  B(5n).  Conditioning  on  the  event  ^4(7n),  we  have 

II ^  || Aj||n  <  4^;  fella  +  Cs'yn  <  4a/s||AS’||2  H-  (33) 

ieS  j&S 

Our  next  step  is  to  upper  bound  ||A,sr||2  in  terms  of  1 1 A 5- 1 1 and  s^/n.  We  split  our  analysis  into  two 
cases. 
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Case  1:  If  ||As||2  <  Sn  =  ©(\4s7n),  then  combined  with  the  bound  (33),  we  conclude  that 

||As||nji  <  Csjn.  (34) 

Case  2:  Otherwise,  we  have  ||As||2  >  5n.  Note  that  the  function  A g  =  Yljes  A?  belongs  to  the 
class  2 J7,  so  that  it  is  covered  by  the  event  B(5n).  In  particular,  conditioned  on  the  event  B(5n), 
we  have  HA5H2  <  2||A,s||n.  Combined  with  the  bound  (33),  we  conclude  that 

II Aslln,!  <  C{v^||As||n  +  S7n}.  (35) 

Note  that  (disregarding  constants)  the  bound  (34)  is  at  least  as  good  as  the  bound  (35); 
therefore,  in  either  case,  a  bound  of  the  form  (35)  holds.  Substituting  the  inequality  (35)  in 
the  bound  (30)  yields 

||A||^  <  C  {7n||As||n)i  +  57^}  <  C{Vs7n||A5||n  +  s7^}.  (36) 

Since  ||  As||n  <  ||A||n,  the  bound  (36)  implies  that  ||A||n  <  C  y/sjn.  This  bound  is  valid  conditioned 
on  the  events  T(7 n),  -4,(7 n)  and  B(5n).  Using  Lemmas  1,  3  and  4  in  conjunction,  we  obtain 

lP(T(7n)  n  Ai^n)  n  B(Sn))  >  1  -  Cl  exp  (  -  C2n7^), 

thereby  showing  that  \\f  —  /*||n  <  Cy/s'fn  with  the  claimed  probability. 

Finally,  let  us  extend  the  result  to  the  error  ||IIj-(/)  —  /* || 2,  as  mentioned  in  the  remarks 
following  Theorem  1.  In  order  to  do  so,  we  exploit  Lemma  4.  Since  the  function  IIj-(/)  —  f* 
belongs  to  2 J7,  we  may  apply  the  lemma  to  it.  We  conclude  that  either  ||IIjr(/)  —  /*||2  <  Sn,  in 
which  case  we  are  done,  or  that,  with  probability  at  least  1  —  c\  exp(— C2ne)^),  we  have 

||n^(/)-r||2<2||n^(/)-r||n 

<2{iinJ,(/)-/i|n  +  n/-rii4 

where  the  second  step  follows  by  triangle  inequality.  Now  by  definition  of  the  projection,  since 
f*  G  T  C  2J7,  we  must  have  \\f  —  f*\\n  >  || /  —  II^r(/)||n,  from  which  we  conclude  that 

iin^(/)-/i2<4||/-riin, 

which  completes  the  proof  of  Theorem  1. 

4.2  Proof  of  Theorem  2 

We  now  turn  to  the  proof  of  the  minimax  lower  bounds  stated  in  Theorem  2.  For  both  parts  (a) 
and  (b),  the  first  step  is  to  follow  a  standard  reduction  to  testing  (e.g.,  [14,  35,  36])  so  as  to  obtain 
a  lower  bound  on  the  minimax  error  DJTp^)  in  terms  of  the  probability  of  error  in  a  multi-way 
hypothesis  testing.  We  then  apply  different  forms  of  the  Fano  inequality  [36,  35]  in  order  to  lower 
bound  the  probability  of  error  in  this  testing  problem.  Obtaining  useful  bounds  requires  a  precise 
characterization  of  the  metric  entropy  structure  of  J~d.s.H-  as  stated  in  Lemma  5. 
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4.2.1  Reduction  to  testing 

We  begin  with  the  reduction  to  a  testing  problem.  Let  {f1, . . . ,  fN}  be  a  <5n-packmg  of  T  in  the  ||  ■  ((2- 
norm,  and  let  0  be  a  random  variable  uniformly  distributed  over  the  index  set  [N]  :={l,2,...,iV}. 
Note  that  we  are  using  N  as  a  shorthand  for  the  packing  number  M(<5n;  F,  ||  •  H2).  A  standard 
argument  (e.g.,  [14,  35,  36])  then  yields  the  lower  bound 

inf  sup  P[||7-  ml  >  51/2}  >  inf  P[0  /  0],  (37) 

/  /‘eJ7  0 


where  the  infimum  on  the  right-hand  side  is  taken  over  all  estimators  0  that  are  measurable 
functions  of  the  data,  and  take  values  in  the  index  set  [N]. 

Note  that  P[0  7  0]  corresponds  to  the  error  probability  in  a  multi-way  hypothesis  test,  where 
the  probability  is  taken  over  the  random  choice  of  0,  the  randomness  of  the  design  points  X”  :  = 
and  the  randomness  of  the  observations  Y™  :  =  {y?;}”=i  ■  Our  initial  analysis  is  performed 
conditionally  on  the  design  points,  so  that  the  only  remaining  randomness  in  the  observations  Y[l 
comes  from  the  observation  noise  {wi}f=  1.  From  Fano’s  inequality  [11],  for  any  estimator  0,  we 

have  P  0  /  0  |  X/}  >1 - 1 — j — ^ - ,  where  Yf1)  denotes  the  mutual  information 


between  0  and  Y/1  with  X/  fixed.  Taking  expectations  over  X/ .  we  obtain  the  lower  bound 


>[0  /  0]  >  1  - 


E,Yr[/xr(0;17)]  +  log  2 
logiV 


(38) 


The  remainder  of  the  proof  consists  of  constructing  appropriate  packing  sets  of  J7,  and  obtaining 
good  upper  bounds  on  the  mutual  information  term  in  the  lower  bound  (38). 


4.2.2  Constructing  appropriate  packings 

We  begin  with  results  on  packing  numbers.  Recall  that  log  M(5;F,  ||  •  H2)  denotes  the  5-packing 
entropy  of  T  in  the  ||  ■  1 1 2  norm. 

Lemma  5.  (a)  For  all  5  £  (0, 1)  and  s  <  d/4,  we  have 

c  7 

log  M(5;  T ,  ||  ■  ||2)  =  0(s  logM(-^=;B^(l),  ||  •  ||2)  +  s  log -) .  (39) 

Vs  s 

(b)  For  a  Hilbert  class  with  logarithmic  metric  entropy  (21)  and  such  that  || /" II 2  <  11/11%;  there 
exists  set  {f1, ...,  fM}  with  log  M  >  C  {slog(d/s)  +  sm} ,  and 

5  <  ]|/fe-/m||2  <  85  forallk^me{l,2,...,M}.  (40) 

The  proof,  provided  in  Appendix  E,  is  combinatorial  in  nature.  We  now  turn  to  the  proofs  of  parts 
(a)  and  (b)  of  Theorem  2. 

4.2.3  Proof  of  Theorem  2(a) 

In  order  to  prove  this  claim,  it  remains  to  exploit  Lemma  5  in  an  appropriate  way,  and  to  upper 
bound  the  resulting  mutual  information.  For  the  latter  step,  we  make  use  of  the  generalized  Fano 
approach  (e.g.,  [36]). 
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From  Lemma  5,  we  can  find  a  set  {fl,  ■  ■  ■ ,  fM}  that  is  a  5-packing  of  T  in  t^-norm,  and 

such  that  ||  fk  —  || 2  <  85  for  all  k,£  G  [M],  For  k  =  1, . . . ,  M,  let  Qk  denote  the  conditional 

distribution  of  Y”  conditioned  on  X™  and  the  event  {0  =  k},  and  let  ||  <Q^)  denote  the 

Kullback-Leibler  divergence.  From  the  convexity  of  mutual  information  [11],  we  have  the  upper 

bound  Ix\'  (0;  Y/1)  <  j^r  Yhk\= l  D(Qk  ||  Qf).  Given  our  linear  observation  model  (6),  we  have 
\  2  ) 

-t  tv  \\  j?k  1 1 2 

D(Qk\\Qi)  =  ^J2{fk(xi)-f(xi))2  =  n|l/  L 


and  hence 


Exr[/xr(^in;©)] 


n  1 

-2(f) 


M 


y  Ex"iii  fk 

k,i=  1 


n  1 

2(f) 


M 

E  n/fc-/£ 

k/=  1 


||  2 
II 2  - 


Since  our  packing  satisfies  ||/fc  —  fe \W  <  6452,  we  conclude  that 

EAT[/xrO?;0)]  <  32n52. 

From  the  Fano  bound  (38),  for  any  5  >  0  such  that  32"jf f  °g  2  <  then  we  are  guaranteed  that 
P[©  /  0]  >  |.  From  Lemma  5(b),  our  packing  set  satisfies  logM  >  C{sm  +  slog(d/s)},  so  that 
so  that  the  choice  52  =  C'  s  logf } ,  for  a  suitably  small  C'  >  0,  can  be  used  to  guarantee 

the  error  bound  P[0  ^  0]  > 


4.2.4  Proof  of  Theorem  2(b) 

In  this  case,  we  use  an  upper  bounding  technique  due  to  Yang  and  Barron  [35]  in  order  to  upper 
bound  the  mutual  information.  Although  the  argument  is  essentially  the  same,  it  does  not  fol¬ 
low  verbatim  from  their  claims — in  particular,  there  are  some  slight  differences  due  to  our  initial 
conditioning — so  that  we  provide  the  details  here.  By  definition  of  the  mutual  information,  we  have 

1  M 

IXn(@-,Yln)  =  -yD(Qk\\¥y), 

1  k= 1 

where  Qk  denotes  the  conditional  distribution  of  Y”  given  0  =  k  and  still  with  X ”  fixed,  whereas 
Py  denotes  the  marginal  distribution  of  Py.  Now  let  {g1, . . .  ,gN}  be  an  e-cover  of  T  in  the  ||  •  H2 
norm,  for  a  tolerance  e  to  be  chosen.  As  argued  in  Yang  and  Barron  [35] ,  we  have 

1  M  1  N 

Ix?{e-,Y?)  =  -yD(®>  \\Fy)  <  D(®k  \\  -  yrk), 

j=  1  k= 1 

where  Pf  denotes  the  conditional  distribution  of  Y”  given  g £  and  X\l.  For  each  £,  let  us  choose 
£*(k)  G  argmin^i  .  ^Ar  \\g£  —  fk\\2-  We  then  have  the  upper  bound 

1  M 

/AT(0;Yn  <  ±Sr{logN  +^\\grW  -  fk\\2n}. 

1  k=  1 
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Taking  expectations  over  X\l,  we  obtain 


Ea r[/xr(B; IT)]  <  ^  E  { logiV  +  -  /fe||*]} 

J  fc=i 

<  log  N  +  ^  e2, 

where  the  final  inequality  follows  from  the  choice  of  our  covering  set. 

From  this  point,  we  can  follow  the  same  steps  as  Yang  and  Barron  [35].  The  polynomial 
scaling  (22)  of  the  metric  entropy  guarantees  that  their  conditions  are  satisfied,  and  we  conclude 
that  the  minimax  error  is  lower  bounded  any  5n  >  0  such  that 

n5l  log  N(Sn]  -T7,  ||  •  || 2) - 

From  Lemma  5  and  the  assumed  scaling  (22),  it  is  equivalent  to  solve  the  equation 

n5i  s  log(d/ s)  +  s(y/s/5n)1/a, 

from  which  some  algebra  yields  5 2  =  C {  s log^/s)  +  s(^) 2a+1 }  as  a  suitable  choice. 


5  Discussion 

In  this  paper,  we  have  studied  estimation  in  the  class  of  sparse  additive  models  defined  by  uni¬ 
variate  reproducing  kernel  Hilbert  spaces.  In  conjunction,  our  two  main  results  provide  a  precise 
characterization  of  the  minimax-optimal  rates  for  estimating  f*  in  the  L2(IP)-norm  for  various  ker¬ 
nel  classes.  These  classes  include  the  case  of  finite-rank  kernels  (with  logarithmic  metric  entropy), 
as  well  as  kernels  with  polynomially  decaying  eigenvalues  (and  hence  polynomial  metric  entropy). 
In  order  to  establish  achievable  rates,  we  analyzed  a  simple  M- -estimator  based  on  regularizing  the 
least-squares  loss  with  two  kinds  of  -fi-based  norms,  one  defined  by  the  univariate  Hilbert  norm 
and  the  other  by  the  univariate  empirical  norm.  On  the  other  hand,  we  obtained  our  lower  bounds 
by  a  combination  of  approximation-theoretic  and  information-theoretic  techniques.  An  interesting 
feature  of  the  minimax  rates  derived  here  is  that  they  exhibit  a  natural  decoupling  into  the  complex¬ 
ities  associated  with  two  sub-problems.  The  first  term  corresponds  to  the  difficulty  of  performing 
subset  selection — that  is,  determining  which  s  out  of  d  co-ordinate  functions  are  active.  The  second 
term  corresponds  to  the  difficulty  of  estimating  a  sum  of  s  univariate  functions,  assuming  that  the 
correct  co-ordinates  are  known. 

There  are  a  number  of  ways  in  which  this  work  could  be  extended.  For  instance,  although 
our  analysis  was  based  on  assuming  independence  of  the  covariates  xj,  j  =  1,  2, ...  d,  it  would  be 
interesting  to  investigate  the  case  when  the  random  variables  are  endowed  with  some  correlation 
structure.  One  might  expect  some  changes  in  the  optimal  rates,  particularly  if  many  of  the  variables 
are  strongly  dependent.  This  work  considered  only  the  function  class  consisting  of  sums  of  univariate 
functions;  a  natural  extension  would  be  to  consider  nested  non-parametric  classes  formed  of  sums 
over  hierarchies  of  subsets  of  variables.  Analysis  in  this  case  would  require  dealing  with  dependencies 
between  the  different  functions. 


19 


Acknowledgements 

This  work  was  partially  supported  by  NSF  grants  DMS-0605165  and  DMS-0907632  to  MJW  and 
BY.  In  addition,  BY  was  partially  supported  by  the  NSF  grant  SES-0835531  (CDI)  and  as  well 
as  a  grant  from  the  MSRA.  MJW  was  also  partially  supported  AFOSR  Grant  FA9550-09-1-0466. 
During  this  work,  GR  was  financially  supported  by  a  Berkeley  Graduate  Fellowship. 


A  Proof  of  Lemma  1 


Define  the  function 

1  n 

TLn,j{r)  :  =  Eu,[  sup  -  ^  Wjg(xjj)] , 
hj\\n<r  n  i=1 

and  let  Vnj  >  0  denote  the  smallest  positive  solution  of  the  inequality  256  r2  >  Knj(r).  The 
function  TZnj(r)  defines  the  local  Gaussian  complexity  of  the  kernel  class  in  co-ordinate  j.  Using 
the  techniques  of  Mendelson  [24] ,  it  follows  that  there  is  an  universal  constant  cq  >  0  such  that 


Kn,j{r)  < 


Emin{/U,r2}]1/2, 

3= 1 


(41) 


where  {fie}™—!  are  the  eigenvalues  of  the  empirical  kernel  matrix.  (The  results  of  Mendelson  are 
stated  for  the  population  Rademacher  complexity,  but  a  similar  argument  establishes  the  bound  (41) 
for  the  empirical  Gaussian  complexity.) 

Recall  that  the  critical  univariate  rate  un  is  defined  in  terms  of  the  closely  related  function 
nn{r)  :=  V2 >  where  are  the  eigenvalues  of  the  (population)  kernel 

operator.  Define  the  event 


'D(jn)  ■  =  { Aj  <  7n,  for  all  j  =  1,2,  (42) 

where  we  recall  that  'yn  :  =  max  jz/n,  •  It  is  a  consequence  of  Lemma  6  in  Appendix  F  that 

P[P(7n)]  >  1  —  ci  exp(— C2n72).  Consequently,  we  proceed  by  conditioning  on  this  event  throughout 
the  remainder  of  the  proof. 

In  the  remainder  of  the  proof,  our  goal  is  to  prove  that 

1  n 

\-^2wifjixij)\  <  C  {y2  \\fj\\n  +  7 n  \\fj\\n}  for  all  fj  €  T~L  (43) 

i= 1 

with  probability  greater  than  1  —  ci  exp(— C2n,72).  By  combining  this  result  with  our  choice  of  'jn 
and  the  union  bound,  the  claimed  bound  on  P[7_(7n)]  then  follows. 

If  fj  =  0,  then  the  claim  (43)  is  trivial.  Otherwise,  we  write 

1  n  i  n 

-E «*/*(*«)  =  WfjWn  where  g3  :  =  fj/\\fj\\n- 
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Noting  that  \\(jj ||%  =  1,  we  are  led  to  study  the  random  variable 

1  n 

Zn  j(w;rj):=  sup  -  V'  Wigj(xij), 

Ha  ll  <r  n 
17)11™— 7  j=l 

a  quantity  that  satisfies  TZnj(rj)  =  E w[Znj(w;rj)]  by  construction. 

Our  next  step  is  to  establish  that  for  any  fixed  radius  77  >0,  we  have  tail  bound 


zn,j(w;  rj)  >c{ 'll  +  7 n  rj)  <  ci  exp  {  -  c2n7,2(l  +  (7 n/rj)2)}. 


(44) 


We  then  use  a  peeling  argument  to  extend  the  bound  to  a  uniform  one  over  the  radius  77. 


Establishing  the  tail  bound  (44):  Viewing  Znj  as  a  function  of  w,  we  first  bound  its  Lipschitz 
constant.  For  any  two  vectors  w,w'  6  Mn,  we  have 

1  n 

I Zn,j{w,rj)  -  Zn>j(w'\rj)\  <  -  sup  |  VVru*  -  w'i)gj{xij) I 

n\\9i\\ n<Tj 

<  -^=11 10  —  w'\\2 

yjn 

Therefore,  by  concentration  of  measure  for  Lipschitz  functions  of  Gaussian  variables  [19],  we  have 

t2 

F[Znd(w;rj)  >E[Znj(w;rj)]+t\  <2exp(-n— ^).  (45) 

2^- 

Setting  t  =  7 n(rj  +  7 n)  yields  an  upper  bound  of  the  form  of  right-hand  side  of  equation  (44). 

In  order  to  complete  the  proof  of  the  bound  (44),  we  need  to  show 

E [Zn,j(w;  rj)]  <  C{^2  +  77}  for  all  77  >  0. 

We  do  so  by  splitting  into  two  cases. 

Case  1:  If  rj  <  vnj,  then  we  have  'R-n.j  (r3 )  <  'R.nj(yn,j)  <  256  u2  ■  where  the  second  inequality 
follows  from  our  choice  of  Vnj . 


Case  2:  Otherwise,  if  rj  >  z/nj,  we  have 

r  1  71 

T^n,j{r)  =  Era[  sup  -Y ^Wig(xij)\ 

un,j  n  7^1 

hA  K<V 

T  ^  ^ 

—  jj  ^nj^nj) 
un,j 

<  256 rvnj, 

where  the  final  line  uses  the  fact  that  TZnj(unj)  "A  256  u2  j.  Combining  yields  the  bound 

Kn,j(r)  <  C{Dld  +  r  vnj } . 

Under  the  event  X>(7n)  previously  defined  (42),  we  have  vnj  <  7 n,  so  that  the  proof  of  the  claim  (44) 
is  complete. 
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Peeling  argument:  We  now  use  the  bound  (44)  to  prove  the  bound  (43),  in  particular  via 
a  “peeling”  operation  over  all  choices  of  rj  =  H/jUn/ll/jll'H-  We  first  claim  that  it  suffices  to 
consider  r  E  (0,1].  In  order  to  show  that  r  <  1,  it  is  equivalent  to  show  that  |  c/j  1 1 n  <  1  for  any 
gj  E  B%(1).  Recall  that  we  have  assumed  that  \\gj ||oo  <  1  for  all  gj  E  B%(1).  Consequently, 
whenever  gj  E  B^(l),  we  have  \\gj\\2n  =  \  Yli=i9j(xij)  <  1,  as  required. 

Now  define  the  event 

7}(7n):=j3/iEBw(l)  |  \-Y,Wifj{xij)\>2C\\fj\\H{1l  +7nM^L}).  (46) 

l  n  i=1  WJjW'H  ) 

Note  that  we  have  the  decomposition  7)(7n)  =  7^(7 n)  U  7)s(7n),  where 

TjA(-/n)  ■=  T)(7n)  n  {IjTT-  <  In},  and 
WJjW’H 

7f(7n):=7H7n)n{Md^E(7n,l]}. 

WJjW’H 

It  remains  to  obtain  upper  bounds  on  the  probabilities  of  these  two  events. 


Case  A:  For  m  =  1,2,3,...,  define  the  sets 

Sm:={^L< 


7 n  ^  1 1  fj  1 1  n  ^  7 n 


WfjWn 


<  — 


If  the  event  Tj  (7n)  occurs,  then  it  must  occur  for  a  function  fj  belonging  to  some  Sm ,  so  that  we 
have  a  function  fj  such  that  ||/j||n/||/j||'H  <  2™-1  = :  rm,  and 

\f~J2  wifj(xij)\  >  2C  \\fj\\u{ll  +7n  Mrp} 

n  .  WJjW'H 


i=  1 


>2C  WfjWnH  +^} 

>  C  \\fj\\n{ll  +rm}, 

which  implies  that  Zn(w;rm )  >  C{7^  +  rm}.  Consequently,  by  union  bound  and  the  tail 

bound  (44),  we  have 


^[T3A(Hn)\  <  ci  Y  exp  {  -  c2n7^(l  +  (7 n/rm)2)} 

m= 1 
00 

=  Cl  Y  exp  {  -  c2nj2(l  +  22mY  } 

m=  1 

<  ci  exp(-c2n7^). 


Case  B:  In  this  case,  we  define  the  sets 

Wfj\\n 


Sm  :  =  {2m”17n  <  <  2m7n}  for  m  =  1,2,...,  M, 

WJjW’H 
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where  M  =  21og2(l/7n)  so  that  2Mr)n  >  1.  By  the  same  argument,  we  then  have 

p[7jS(7n)]  <  M  ci  exp(-c2ny^) 

<  ci  exp  (  -  c2ny^  +  2  log(l/7„)) 

<  c'iexp(-c2n7^), 

by  the  condition  =  D(log(l/yn)). 

B  Proof  of  Lemma  2 

Define  the  function 

~  1  n 

£(A)  :  =  —  ( wi  —  A(ay))  +  An ||/  +  A||n,l  +  Pn||/*  +  A||w,l 

Z=1 

and  note  that  by  definition  of  our  M-estimator,  the  error  function  A  :=/  —  /*  minimizes  £.  From 
the  inequality  £(A)  <  £(0),  we  obtain 

1  n  d  d 

2  l|A|£  <  +  An  £  {II/;  II-  -  II/;  +  Ailln}  +  Pn  ^  {II  f'M  ~  II//  +  ^  M  • 

i=l  j= 1  j= 1 

Now  for  any  j  G  S'c,  we  have 

ll/;i|n-ll/;  +  Ai||n  =  -HA.-Iin,  and  ||/;||W  -  II/;  +  a,-^  =  -ha^. 

On  the  other  hand,  for  any  j  E  S,  the  triangle  inequality  yields 

Wfj  l|n  —  \\fj  +Aj||n<  1 1 A  jf  1 1  n , 

with  a  similar  inequality  for  the  terms  involving  ||  ■  ||%.  Since  ^||A||^  >  0,  we  conclude  that 
1  n 

0  <  -  ^uyA(ay)  +  An{||  Ag||„.i  —  ||  Age  ||n,i}  +  Pn{  ||  Ag||Wii  -  ||Agc||Wii}.  (47) 

n  i= l 

Recalling  our  conditioning  on  the  event  T( yn),  we  have  the  upper  bound 

1  n 

|-^twiA(a:i)|  <2k3  {7n||A||n,i  +  7nll A||w,i}- 
i— 1 

Combining  with  the  inequality  (47)  yields 

0  <  2k3  {y^ll  A||n,l  +7nll^ll«,l}  +  AnjllAsH^i  -  ||  Age  ||nji}  +  pn{  ||  AgH^i  -  HAgcH^i} 

<  -^||A||n,l  +  ^||A||Wii  +  An{||Ag||n,l  -  || Age ||n,l }  +  Pn{II^S,||H,l  ~  ||  Age  ||  W>i } , 

where  we  have  recalled  our  choices  of  (A n,pn).  Finally,  re-arranging  terms  yields  the  claim  (29). 
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C  Proof  of  Lemma  3 

Lemma  3  is  a  straightforward  consequence  of  Lemma  7  from  Appendix  F.  In  particular,  applying 
the  latter  lemma  with  t  =  7n/2  >  en  yields 

WfjWn  <  Wfjh  +  y  for  all  fj  <E  B-h(I)  and  ||/j||2  <  7n/2 

with  probability  greater  than  1— ci  exp(— c2rry2).  On  the  other  hand,  if  \\fj\\2  >  7n/2,  then  the  sand¬ 
wich  relation  (66)  implies  that  \\fj\\n  <  2 1 1  1 1 2  with  probability  greater  than  1  —  c\  exp(— c2n72). 

Defining  the  rescaled  functions  gj  =  2 fj  e  B%(2),  we  have  established  that 

p[^-i(7n)]  <  ciexp(-c2n7^). 

Recalling  that  A{^n)  =  n‘-=1TL,(7n),  we  can  combine  this  upper  bound  with  union  bound,  thereby 
obtaining 

P[.4c(7n)]  <  d  ci  exp(-c2n7^)  <  a  exp (-c'2 ray2), 
where  we  have  used  the  fact  that  'yn  =  0(0^^). 


D  Proof  of  Lemma  4 


Define  the  alternative  event 

B'(5n)  :  =  {{\\h\\l  >  82J 4  for  all  he  2T  with  \\h\\2  =  8n 


We  claim  that  it  suffices  to  show  that  B’  (8 n)  holds  with  high  probability.  Indeed,  given  an  arbitrary 
g  €  2T  =  {/  +  f  |  /,  f  €  T}  with  ||^|| 2  >  8n,  we  can  dehne  h  =  y yrg.  Since  g  €  2J7  and  2J7  is 
star-shaped,  we  have  h  6  2J7,  and  also  ||/i||2  =  <5n  by  construction.  Therefore,  if  B'(8n )  holds,  we 
have  \\h\\n  >  8\j 4,  which  implies  that 


or  equivalently  that  ||g||2  >  || <? || |/4,  showing  that  B(8n)  holds. 

Accordingly,  the  remainder  of  the  proof  is  devoted  to  showing  that  B'{5n )  holds  with  high 
probability.  For  a  truncation  level  r  >  0  to  be  chosen,  define  the  function 


4>r(u) 


u2  if  |rt|  <  r 
t2  otherwise. 


(48) 


By  construction,  <fT  is  continuous,  Lipschitz  with  constant  2 r,  and  bounded  by  r2.  Since  u2  >  4>T(u) 
for  all  u  e  M,  we  have 

.  7i  -\  n 

-  Y  92ixi)  >  -  Y  (49) 

n  z J  n 

2—1  2—1 

The  remainder  of  the  proof  consists  of  the  following  steps: 
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(1)  First,  we  show  that  for  all  g  E  2 T  with  ||g||2  =  Sn,  we  have 


EfertsM)]  > 


(50) 


(2)  Next  we  prove  that 


sup 

g&T 

Il9l|2<<5n 


1  n 
1=1 


(51) 


with  high  probability. 


Putting  together  the  pieces,  we  conclude  that  for  any  g  e  T  with  ||g||2  =  5n,  we  have 


i  n  >-2  r2  r2 

;E^HW)4-T-  t 


2—1 


with  high  probability  (to  be  specified  later).  Combined  with  the  lower  bound  (49),  this  shows  that 
event  B'(5n)  holds  with  high  probability,  thereby  completing  the  proof.  It  remains  to  establish  the 
claims  (50)  and  (51). 


Establishing  the  lower  bound  (50):  By  the  definition  of  <j>T,  we  have 

E[cj)T(g(x))]  >  E[g2(x)I[\g(x)\  <  r]] 

=  E[fi>2(x)]  -  E[52(.t)I[|5(x)|  >  r]] 

=  ^l-^[92(x)l[\g{x)\  >  t]]  . 

Consequently,  it  suffices  to  show  that,  with  appropriate  choice  of  the  truncation  level,  we  have 
E[g2(x)  I[|(/(x)|  >  r]]  <  e)2/2.  By  the  Cauchy-Schwarz  inequality,  we  have 

(E[^2(x)1I[|5(.t)|  >  r]])2  <  E[fi(4(x)]  E[l2[|5r(x)|  >  r]] 

=  E[/(x)]  P[|ff(x)|2  >  t2] 

<E  b4(x)]^,  (52) 

where  the  final  step  uses  Markov’s  inequality,  and  the  fact  E[g2(x)]  =  <52.  It  remains  to  bound  the 
fourth  moment.  Any  g  6  2 T  can  be  written  as  a  sum  g  =  YljeU  gj  of  univariate  functions  over  a 
subset  U  of  cardinality  at  most  2s,  so  that 

Eb4(*)]  =E[(5^ft'(*J'))4]  =  ^2E\9j(xj)]  +  (g)  Ei9j(xj)\E[g2k(xk)\, 

j&U  j£U  W  jeUk£U\{j} 
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where  we  have  used  the  binomial  expansion,  the  independence  of  Xj  from  co-ordinate  to  co-ordinate, 
and  the  fact  that  E[gj(xj)\  =  0.  Re-arranging  the  second  sum  yields 

Eb4(^)]  =  J2E\9j(xj)}  +  Ei9k(xk)} 

j&u  V  J  jeu  keu\{j} 

<  ^2E[9j{xj)}  +  (f)  {  ^2^[gj(xj)}\  E[#2(.t)] 

j£U  ^  ^  j£U  ' 

=  ^2E[9j(Xj)}  +  (fWn- 
jeU  v  J 

For  a  function  g  =  J2jeU  9j  ^  2J7,  each  univariate  function  satisfies  Halloo  <  2,  so  that  we  have 
IE [gj(xj)]  <  4E [gj(xj)],  and  hence  Y,j&UE[9j(xj)]  <  ^J2jeUE\9jixj)\  =  4(5n-  Overall,  we  have 
shown  that 

E[54(*)]<4<52+6£4  <1061 
Substituting  back  into  the  inequality  (52),  we  find  that 

E[g2(x)I[\g(x)\  >  r]]  <  /J  < 

so  that  setting  r  =  2  \/l0  is  sufficient  to  prove  the  claim  (50). 

Establishing  the  bound  (51):  For  a  fixed  subset  U,  recall  the  definition  (4)  of  the  function 
class  71(11).  Note  that  the  function  class  2 T  can  be  written  as  U|c/|=2s  where  the  co-ordinate 

functions  gj  satisfy  the  bound  |  gj  \  |  oc  <  2.  Accordingly,  we  define  the  random  variable 

1  n 

Zn(U):=  sup  \-^24>T(g(xi)) -E[(j)T(g(x))]\,  (53) 

aeu(u)  n  ,=1 

llffl|2<<5n 

and  claim  that  it  suffices  to  show  that 

F\Zn(U)  >±(52n  +  t5n  +  t2)]  <  ci  exp(-c2nf2)  for  all  i  >  0.  (54) 

L  lb 

Indeed,  assuming  that  this  bound  holds,  then  by  applying  the  union  bound  over  all  (9rfs)  subsets  of 
cardinality  at  most,  we  have 

1  ^  1  { d  \ 

p[  SUP  I  -  yz^rigixi))  -  E[<t>r(g(x))]\  >  — (52 +t6n  +  t2)]  <  Cl  exp  {- c2nt2  -flog  (  )}. 

9S2J-  nir1  1  16  V2s/ 

\\9h<Sn 

Setting  t  =  5n  and  noting  that  our  choice  of  6n  ensures  that  ^n<52  >  log  (9rfJ  yields  the  claim  (51). 

Accordingly,  we  now  prove  the  bound  (54).  The  functions  <fiT(g(x))  are  uniformly  bounded  by 
r2.  Moreover,  since  <j)T(u)  =  min{«2,r2},  we  have 

E[^2r(9{x))\<r2E[<pT(g(x))}  <T2E[g2(x)]  <  T262n, 
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where  the  final  inequality  uses  the  fact  that  E[g2(x)]  <  <52.  Consequently,  we  have  shown  that 
vav((j)T(g(x)))  <  E[</>2(g(x))]  <  t252.  We  now  apply  Corollary  7.9  in  Ledoux  [19]  with  e  =  1, 
r  =  c2nf2  and  a2  =  nr2<5 2  to  conclude  that 

P [Zn(U)  >  2E [Zn(U)\  +  —  t8n  +  1 2)]  <  ci  exp(-c2nt2)  (55) 

L  lb  Id 

for  some  universal  constants  (ci,c2).  (In  this  step,  we  can  choose  c2  small  enough  so  as  to  obtain 
the  constants  1/16.) 

Based  on  the  bound  (55),  our  remaining  task  is  to  show  that  E [Zn{U)]  <  ^<52.  By  a  standard 
symmetrization  argument,  we  have 

1  n 

E[Zn{U)\  <2EX)(J[  sup  -'Y]<Titj>T{g(xi))\\, 
gen(U) 

\\gh<Sn 

where  is  an  i.i.d.  sequence  of  Rademacher  variables.  Since  the  function  cj)T  is  Lipschitz  with 

constant  2 r,  the  Ledoux-Talagrand  contraction  inequality  (p.  112,  [20])  implies  that 

1  n 

E[Zn(U)\  <  4rEX!tT[  sup  |-  <jjg(xi)\] 

g£H(U) 

\\gh<Sn 

Note  that  %{U)  is  subset  of  an  RKHS  with  norm  ||g||^[/)  =  llfi'jlllr  Since  II^Hh  5;  2  for 

each  gj ,  we  have  ||g||^m)  <  4 y/s  for  all  g  E  77(11).  Consequently,  we  have 

1  n 

E[Zn(U)\  <  4r  EXj(T[  sup  |  -  ^  Ojg(xj)  |] . 

IIsIImCiO^'V®  n  i= i 
Il9l|2<<5n 

Now  defining  the  rescaled  functions  h  =  g/y/s,  we  have 

-  1  1  . n ., 

E [Zn(U)\  <4 T  y/s  Ex  a  [-  sup  -  V]  (Tih(Xi)] 

n  \Mn(u)<4  n  tt 

1  00  c2 

-  32r  ^  E  min  { if' > a4] V2 

v  t= i 

" - V - ' 

Tn(Sn) 

where  ot\  >  a2  >  •  •  •  are  the  eigenvalues  of  the  kernel  associated  with  the  Hilbert  space  7i(U). 
This  last  inequality  makes  uses  of  standard  upper  bounds  on  kernel  Rademacher  complexities  (e.g., 
see  Mendelson  [24]). 

Now  since  7i(U)  is  a  sum  of  at  most  2s  copies  of  the  same  univariate  Hilbert  space  7~L,  the 
eigenvalues  {a^}^  correspond  to  at  most  2s  copies  of  the  eigenvalues  {/J-e}eLi  of  77.  Consequently, 
by  factoring  out  these  2s  terms,  we  obtain 

Tn(Sn)  <  64r  s  —  [  ^  min{  — ,  m}] 1/2 . 
v  e=i 
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Now,  as  long  as  <I2/s  >  is2,  where  vn  is  the  critical  rate  (12)  for  the  univariate  kernel,  we  are 
guaranteed  that 


oo 

1= 1  s 


“  K0  S  ’ 


and  hence  E[Z„(£/)]  <  Choosing  r  =  2  \/l0  and  no  =  642  VTO,  we  conclude  that 

E[Zn(U)]  <  as  required. 


E  Proof  of  Lemma  5 


Proof  of  part  (a):  Let  IV  =  M(-4=;  B%(1),  ||  •  H2)  —  1,  and  define  X  =  {0,1,...,  N}.  Using 
Hullo  =  Yldj=i  uj  A  0]  t°  denote  the  number  of  non-zero  components  in  a  vector,  consider  the  set 

6  :  =  {11  6  |  ||ii||o  =  s}.  (56) 

Note  that  this  set  has  cardinality  |6|  =  (^)NS,  since  any  element  is  defined  by  first  choosing  s 
co-ordinates  are  non-zero,  and  then  for  each  co-ordinate,  choosing  non-zero  entry  from  a  total  of 
N  possible  symbols. 

For  each  j  =  1  let  {0 ,  fj ,  fj , . . . ,  be  a  S/y/s-  packing  of  B%(1).  Based  on  these 

packings  of  the  univariate  function  classes,  we  can  use  ©  to  index  a  collection  of  functions  contained 
inside  J~.  In  particular,  any  ug6  uniquely  defines  a  function  gu  =  1  9jj  £  -X,  with  elements 


x-  = 


if  Uj  /  0 
0  otherwise. 


(57) 


Since  ||u||o  =  s,  we  are  guaranteed  that  at  most  s  co-ordinates  of  g  are  non-zero,  so  that  g£X 
Now  consider  two  functions  gu  and  hv  contained  within  the  class  {gu,u  6  ©}.  By  definition, 
we  have 


|  gu  -  hl 


> 


x2  a 

—  ^Vj\, 


(58) 


Consequently,  it  suffices  to  establish  the  existence  of  a  “large”  subset  A  C  6  such  that  the 
Hamming  metric  Ph(u,  v)  :  =  Ey=i  ^iuj  /  vj\  is  at  least  s/2  for  all  pairs  u,  v  G  A,  in  which  case  we 
are  guaranteed  that  \\g  —  h\\2  >  <52.  For  any  11  6  6,  we  observe  that 


{ 


(lV  +  l)i. 


This  bound  follows  because  we  simply  need  to  choose  a  subset  of  size  s/2  where  u  and  v  agree,  and 
the  remaining  s/2  co-ordinates  can  be  chosen  arbitrarily  in  ( N  +  1)5  ways.  For  a  given  set  A,  we 
write  ph(u,  A)  <  §  if  there  exists  some  v  G  A  such  that  ph{u,  v)  <  |.  Using  this  notation,  we  have 


{n  G  6  |  ph(u,A)  <  -} 


e  (a) 

{N  +1)5  < 


61 
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where  inequality  (a)  follows  as  long  as 


|.A|  <  N* 


iffl.  ^ 

2  (£)  (TV  +  l)s/2 ' 


Thus,  as  long  as  |„4|  <  N*,  there  must  exist  some  element  m£6  such  that  ph(u,A)  >  in  which 
case  we  can  form  the  augmented  set  A  U  {it}.  Iterating  this  procedure,  we  can  form  a  set  with  N* 
elements  such  that  pjj(u,v)  >  §  for  all  u,  v  G  A. 

Finally,  we  lower  bound  N*.  We  have 


(d  1  ,d  —  s N  5 
N  >  -  (— - ) 2 

—  9  V  „  /o  / 


(NY 


s/2  >  ( N  +  l)s /2 


_ £)§  _/ys/2(  ^  W2 

2y  s/2  ’  yN  +  l> 

)l  jVs/2, 


1  ,d  —  s 
~  2  y  s/2 
>  i  d-. 

~  2y  s/2 

where  inequality  (i)  follows  by  elementary  combinatorics  (see  Lemma  5  in  the  paper  [27]  for  details). 
We  conclude  that  for  s  <  d/4,  we  have 

7  c 

log iV*  =n(slog- +  slogM(-^=;B^(l),||  •  ||2)), 
s  Vs 

thereby  completing  the  proof  of  Lemma  5(a). 


Proof  of  part  (b):  In  order  to  prove  part  (b),  we  instead  let  N  =  M(|;  B%(1),  ||  •  [[2)  —  1,  and 
then  follow  the  same  steps.  Since  logiV  =  f 2(m),  we  have  the  modified  lower  bound 

log  N*  =  ( s  log  -  +  sm) , 

s  ' 

Moreover,  instead  of  the  lower  bound  (58),  we  have 

ll^-^  =  Ell/?'-/rili  ^  >  8’  (59) 

3= 1  i=1 

using  our  previous  result  on  the  Hamming  separation.  Furthermore,  since  |  fj  \ | 2  <  ||/j||%  for  any 
univariate  function,  we  have  the  upper  bound 

j=i  j=i 

By  the  definition  (56)  of  6,  at  most  2s  of  the  terms  fjj  —  f-:i  can  be  non-zero.  Moreover,  by 
construction  we  have  |  /J'  —  <  2,  and  hence 

\\gu-hv\\22<8s. 

Finally,  by  rescaling  the  functions  by  y/8  5 / y/s,  we  obtain  a  class  of  N*  rescaled  functions  {gu,  «6l} 
such  that 

- /ill  >  <*2,  and  \\r-hvg<6452, 

as  claimed. 
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F  Results  on  kernel  classes 


In  this  appendix,  we  collect  some  basic  results  about  reproducing  kernel  Hilbert  spaces,  useful  in 
our  analysis.  Let  TL  be  an  RKHS  of  functions  /  :  X  — >  R.  Let  {cjj}”=1  be  an  i.i.d.  sequence  of 
Rademacher  variables,  and  let  be  an  i.i.d.  sequence  of  variables  from  X,  drawn  according 

to  some  distribution  Q.  For  each  t  >  0,  we  define  the  local  Rademacher  complexities 

(60) 

>  1/n, 
(61) 

v  j= i  v  j= i 

Conditionally  on  {xj}”=1 ,  the  same  bounds  hold  for  Qn(t)  with  the  population  eigenvalues  {pe}(fLi 
replaced  by  the  eigenvalues  {pe}rf=\  of  the  kernel  matrix  dehned  by  the  n  samples.  We  let  en  and 
e n  denote  (respectively)  the  smallest  solutions  (of  size  at  least  1  / y/n)  to  the  inequalities 

e2 

Qn{en)<  7^7,  and  Qn(en)  <  256e2.  (62) 

256 

These  two  quantities  correspond  to  the  critical  rates  derived  from  the  population  and  empirical 
eigenvalues  respectively.  (Our  scaling  by  256  is  for  later  theoretical  convenience.) 


Quit)  :  =  Ect[  sup  -  V]  cng(xi)\ ,  and  Qn(t)  :  =  EXJ  sup  -  V]  <7*3(2;*)] 

\\g\\u<t  n  \\9h<t  n 

IMIh<i  IIsIIhAI 

By  results  due  to  Mendelson  [24],  there  are  universal  constants  C£  <  cu  such  that  for  all  t2 
we  have 


CJ_ 
frt.  1 


Em™ {t2,^}]1/2  <  Qn(t)  <  -y=[5^min 


1/2 


Our  first  result  relates  the  critical  rates  based  on  the  population  and  empirical  eigenvalues.  Recall 
that  7n  :  =  «i  max  {un, 

Lemma  6.  We  have 


T[e™  <  In]  >  1  -  ci  exp(— c27^).  (63) 

This  result  is  exploited  at  the  start  of  Appendix  A.  In  particular,  combined  with  union  bound,  it 
implies  that  the  event  T(jn)  holds  with  high  probability,  as  claimed. 


Our  second  result  provides  uniform  control  on  the  difference  between  the  empirical  ||  •  ||n  and 
population  ||  ■  || 2  norms  over  TL.  In  particular,  for  a  radius  t  >  en,  we  dehne  the  event 

£(t):={  sup  Hl^lln-  H5H2I  >  |}.  (64) 

l|ffl|a<* 

Lemma  7.  Suppose  that  ||(/||oo  <  1  for  all  g  G  B-^(l).  Then  there  exists  universal  constants  (ci,C2) 
such  that  for  any  t  >  en, 

P[£(i)]  <  ci  exp(— C2nt2).  (65) 

Moreover,  for  any  t  >  en,  we  have 

<  \\g\\n  <  T^Wgh  for  all  g  G%(1)  with  \\gh>t  (66) 

with  probability  at  least  1  —  c\  exp(— C2 nt2). 
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F.l  Proof  of  Lemma  7 

Our  proof  is  based  on  the  random  variable 

Yn(t):=  sup  lllfflln  -  lblli|, 

9SB«(i),  \\gh<t 

If  the  event  £(t)  occurs,  then  there  exists  some  g  E  B^(l)  such  that  \\\g\\n  —  Iblbl  >  §,  whence 

IWS-WlhlWU-Wbl  (WU  +  llsW  >  j. 

Therefore,  it  suffices  to  establish  the  upper  bound 

t2 

P [Yn{t)  >  — )  <  ci  exp(-C2nt2). 

We  first  bound  deviations  above  the  expectation  using  concentration  theorems  for  empirical  pro¬ 
cesses  [19].  The  supremum  of  the  variances  is  upper  bounded  by 


72(t):=  sup  ~'}Tvaiig2(xi)  -  \\g\\l)  = 

—  TTTl  \  II  II  ^  .  T)  Y - J 


g&n(l) 


2<t  n 


i=  1 


sup 

Ilf 


E 


2<t 


2\2 


{g2(x)  -  llslll) 


using  the  i.i.d.  nature  of  the  samples  {xi}™=l.  Moreover,  since  the  functions  are  uniformly  bounded 
by  1,  we  have 


7 2(t)  <  32  E 


(s(®)  +  hhY 


<  64 12, 


(67) 


where  the  final  inequality  uses  the  fact  that  E[g2(x)]  =  \\g\\2  <  t2.  Consequently,  applying  Corollary 
7.9  in  Ledoux  [19]  with  e  =  1,  r  =  nt2  and  a2  =  64f2,  we  conclude  that  there  are  universal  constants 
such  that 

t2 

P [Yn(t)  >  2E[Yn(t)]  +  —  ]  <  c\  exp(— c2nt2).  (68) 

We  now  upper  bound  the  mean.  By  a  standard  symmetrization  argument,  we  have 


IE  [Yn(t)}  <  2EX,0 


sup 

06  Bk(1),IIs 


I2  <t 


n 


^2aig2(Xi)  | 


2=1 


where  {crj}”=1  are  i.i.d.  Rademacher  variables.  Since  ||p|| 00  <  1  for  all  g  E  F,  we  may  may  apply 
the  Ledoux-Talagrand  contraction  theorem  ([20],  p.  112)  to  obtain  that 


E[Yn(t)]  <8E: 


sup 

-  g£Mn(l),\\g\\2<t 


1  X 


8  Qn(t). 


But  by  our  choice  (62)  of  en  and  since  t  >  en,  we  have  have  Qn(t)  <  Combined  with  the 
bound  (68),  we  conclude  that 


Yn(t) 


< 


tz  tz 

l - 1 - 

256  20 


< 


4 
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with  probability  at  least  1  —  ci  exp(— C2n22),  as  claimed. 

Finally,  let  us  prove  the  sandwich  relation  (66).  For  any  g  E  B%(1)  with  ||g||2  >  2  >  en,  we  can 
define  the  function  h  :  =  w^9-  Note  that  h  E  B%(1)  and  ||/i||2  =  2,  so  that  when  the  bound  (65) 
holds,  we  have  \\h\\2  —  |  <  \\h\\n  <  \\h\\2  +  ^  or  equivalently,  that 


2 

2 


< 


2 

hh 


WdWn  < 


3 2 
~2  ’ 


with  probability  at  least  1  —  ci  exp(— C2nt2),  which  establishes  the  claim  (66). 


F.2  Proof  of  Lemma  6 


For  any  2  >  0,  define  the  two  random  variables 


Zn(t) 


1 


: =  sup  — 
\\9\\n<t  n 


n 

'Y^0ig{xi), 

1=1 


and  Zn  (2) 


1 

sup  — 
llsl|2<t  71 


n 

^2 Vigixi ), 

2—1 


and  observe  that  E a[Zn(t)\  =  Qn(t )  and  E XtfT[Zn(t)\  =  Qn(t). 

For  any  function  with  ||g||n  <  2,  we  have  varo‘(cr«5'(x*))  =  nllfflln  <  uf2 .  Consequently, 

applying  the  lower  bound  in  Corollary  7.9  of  Ledoux  [19]  with  r  =  C2nt 2  and  e  =  1/2,  we  obtain 

P[Zn(2)  <  ^Qn(t)  -  t2]  <  ci  exp(— c2nt2).  (69) 

Similarly,  for  any  function  with  ||g||2  <  2 ,  we  have  Y17=i  var<r,x(&i9(%i))  =  n||g|||  <  n t2-  Conse¬ 
quently,  applying  the  upper  bound  in  Corollary  7.9  of  Ledoux  [19]  with  r  =  C2nt2  and  e  =  2,  we 
obtain 


P [Zn(t)  >  2 Qn(t)  +  t2]  <  ci  exp(— c2nt2).  (70) 

Now  suppose  that  ||g||2  >  2  >  en.  Then  conditioned  on  the  sandwich  relation  (66),  we  are 
guaranteed  that  ||<7||n  >  Taking  the  contrapositive,  we  conclude  that  ||y||n  <  |  implies  ||g||2  <  2, 
and  hence  that 

Zn (2/ 2)  <  Zn(t)  for  all  2  >  en,  (71) 

under  the  stated  conditioning. 

For  any  2  >  en,  the  inequalities  (69),  (70)  and  (71)  hold  with  probability  at  least  1  —  ci  exp(— C2nt2). 
Conditioning  on  these  inequalities,  we  can  set  2  =  7n  >  en,  and  thereby  obtain 

^  (a)  ^ 

Qn(jn )  <  2Zn(~fn)  +  2y2 

(b) 

<  2Zn(27n)  +  2y2 

<  4Qn(27n)  +  872 

(d) 

<  1287^, 

where  inequality  (a)  follows  from  the  bound  (69),  inequality  (b)  follows  from  the  bound  (71), 
inequality  (c)  follows  from  the  bound  (70),  and  inequality  (d)  follows  since  27n  >  en  and  the 
definition  of  en.  By  the  definition  of  en  as  the  minimal  2  such  that  Qn(2)  <  25622,  we  conclude  that 
en  <  7n,  as  claimed. 
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